Open nomaddo opened 6 years ago
xor.setf -, r1, ra1 /// redundant ? xor.ifzc r0, 1 (1), 1 (1); v8min.ifz r0, 1 (1), 1 (1) ///
These two lines calculate the condition (bool r0 = i == 1000
). In this particular case, we don't really need the value of the condition stored in any register, the correct flags set would be enough.
or.setf -, r0, r0 /// or.ifz r1, r1, r1 ///
These two lines are generated by the phi-node for i
(something like setting the i of the next iteration to the i of the previous one
). or.ifz r1, r1, r1
is of course redundant, with or without using flags, which also makes the previous instruction or.setf -, r0, r0
unnecessary.
or.setf -, elem_num, r0 ///
This instruction is required by the subsequent conditional branches to set the flags correctly. This could be removed, if the last instruction before setting flags already sets the flags for the same source value (and for all SIMD elements!!), which is the case here.
To sum it up: You are right, with additional optimizations, this could be reduced to something like:
xor.setf -, r1, ra1 xor.ifzc.setf -, 1 (1), 1 (1); v8min.ifz.setf -, 1 (1), 1 (1) brr.ifallzc (pc+4) + 4 ...
The output of clang
is as follows:
; Function Attrs: nounwind
define spir_kernel void @loop1(float addrspace(1)* nocapture %a, float addrspace(1)* nocapture %b) #1 {
entry:
br label %for.body
for.body: ; preds = %for.body, %entry
%i.09 = phi i32 [ 0, %entry ], [ %inc, %for.body ]
%sub = sub nsw i32 0, %i.09
%conv = sitofp i32 %sub to float
%arrayidx = getelementptr inbounds float addrspace(1)* %a, i32 %i.09
store float %conv, float addrspace(1)* %arrayidx, align 4, !tbaa !10
%arrayidx3 = getelementptr inbounds float addrspace(1)* %b, i32 %i.09
store float %conv, float addrspace(1)* %arrayidx3, align 4, !tbaa !10
%inc = add nuw nsw i32 %i.09, 1
%exitcond = icmp eq i32 %inc, 1000
br i1 %exitcond, label %for.end, label %for.body
for.end: ; preds = %for.body
ret void
}
One to one translation is a little bit hard to do such optimization.
But if we can handle icmp eq ...
and br ...
in the same time, better optimization can be implemented in translation phase to vc4.
We have two choice:
Peephole optimization might be general (it can be used for other bit-operation optimization), but latter is easier.
One more thing: to jump neightbor basic_block, this cod output jump instruction.
brr.ifallzc (pc+4) + 4
can be deleted in this case.
You are right, with additional optimizations, this could be reduced to something like:
xor.setf -, r1, ra1 xor.ifzc.setf -, 1 (1), 1 (1); v8min.ifz.setf -, 1 (1), 1 (1) brr.ifallzc (pc+4) + 4 ...
Scratch that. According to here, we cannot set flags on both ALUs depending on the condition of their execution, thus xor.ifzc.setf -, 1 (1), 1 (1); v8min.ifz.setf -, 1 (1), 1 (1)
won't work. Also the or.ifz r1, r1, r1
is original a i32 %18 = i32 %17
and therefore the same issue as discussed in #14.
@doe300 you already done it? If so, let's close the issue.
No, I haven't found a fitting optimization yet.
The for-loop creates redundant instructions in the instructions of checking exit condition.
ex)
To me, above instructions (specifying
///
) can be expressed asisub.setf -, r1, ra1
. Then, if the condition is negative, jump to the head of loop. Otherwise go through.