doe300 / VC4C

Compiler for the VC4CL OpenCL implementation
MIT License
118 stars 37 forks source link

Remove redundant instructions for checking for-loop-exit-condition #22

Open nomaddo opened 6 years ago

nomaddo commented 6 years ago

The for-loop creates redundant instructions in the instructions of checking exit condition.

ex)

kernel void loop1 (global float a[], global float b[]) {
  for (int i = 0; i < 1000; i++) {
    a[i] = -i;
    b[i] = -i;
  }
}
// Module with 1 kernels, global data with 0 words (64-bit each), starting at offset 1 words and 0 words of stack-frame
// Kernel 'loop1' with 67 instructions, offset 2, with following parameters: __global out float* a (4 B, 1 items), __global out float* b (4 B, 1 items)
or -, unif, unif
or -, unif, unif
or -, unif, unif
or -, unif, unif
or -, unif, unif
or -, unif, unif
or -, unif, unif
or -, unif, unif
or -, unif, unif
or -, unif, unif
or -, unif, unif
or -, unif, unif
or -, unif, unif
or ra0, unif, unif
or r2, unif, unif
or r1, 0 (0), 0 (0)
nop.never 
or r0, r1, r1
sub r0, 0 (0), r0
itof r3, r0; v8min r0, r1, r1
shl r0, r0, 2 (2)
add r0, ra0, r0
or -, mutex_acq, mutex_acq
ldi vpw_setup, 6656
or vpm, r3, r3
ldi vpw_setup, 2155954176
ldi vpw_setup, 3221291008
or vpw_addr, r0, r0
or r0, r1, r1
or r1, r1, r1
shl r0, r0, 2 (2)
add r1, r1, 1 (1)
add r0, r2, r0
ldi ra1, 1000
or -, vpw_wait, vpw_wait
or mutex_rel, 1 (1), 1 (1)
or -, mutex_acq, mutex_acq
ldi vpw_setup, 6656
or vpm, r3, r3
ldi vpw_setup, 2155954176
ldi vpw_setup, 3221291008
or vpw_addr, r0, r0
or -, vpw_wait, vpw_wait
or mutex_rel, 1 (1), 1 (1)
xor.setf -, r1, ra1                                             /// redundant ?
xor.ifzc r0, 1 (1), 1 (1); v8min.ifz r0, 1 (1), 1 (1)           /// 
or.setf -, r0, r0                                               ///
or.ifz r1, r1, r1                                               ///
or.setf -, elem_num, r0                                         ///
brr.ifallzc (pc+4) + 4
nop.never 
nop.never 
nop.never 
brr.ifanyz (pc+4) + -40
nop.never 
nop.never 
nop.never 
or r0, unif, unif
or.setf -, elem_num, r0
brr.ifallzc (pc+4) + -63
nop.never 
nop.never 
nop.never 
not irq, qpu_num
nop.thrend.never 
nop.never 
nop.never 

To me, above instructions (specifying ///) can be expressed as isub.setf -, r1, ra1. Then, if the condition is negative, jump to the head of loop. Otherwise go through.

doe300 commented 6 years ago

xor.setf -, r1, ra1 /// redundant ? xor.ifzc r0, 1 (1), 1 (1); v8min.ifz r0, 1 (1), 1 (1) ///

These two lines calculate the condition (bool r0 = i == 1000). In this particular case, we don't really need the value of the condition stored in any register, the correct flags set would be enough.

or.setf -, r0, r0 /// or.ifz r1, r1, r1 ///

These two lines are generated by the phi-node for i (something like setting the i of the next iteration to the i of the previous one). or.ifz r1, r1, r1 is of course redundant, with or without using flags, which also makes the previous instruction or.setf -, r0, r0 unnecessary.

or.setf -, elem_num, r0 ///

This instruction is required by the subsequent conditional branches to set the flags correctly. This could be removed, if the last instruction before setting flags already sets the flags for the same source value (and for all SIMD elements!!), which is the case here.

To sum it up: You are right, with additional optimizations, this could be reduced to something like:

xor.setf -, r1, ra1 xor.ifzc.setf -, 1 (1), 1 (1); v8min.ifz.setf -, 1 (1), 1 (1) brr.ifallzc (pc+4) + 4 ...

nomaddo commented 6 years ago

The output of clang is as follows:

; Function Attrs: nounwind
define spir_kernel void @loop1(float addrspace(1)* nocapture %a, float addrspace(1)* nocapture %b) #1 {
entry:
  br label %for.body

for.body:                                         ; preds = %for.body, %entry
  %i.09 = phi i32 [ 0, %entry ], [ %inc, %for.body ]
  %sub = sub nsw i32 0, %i.09
  %conv = sitofp i32 %sub to float
  %arrayidx = getelementptr inbounds float addrspace(1)* %a, i32 %i.09
  store float %conv, float addrspace(1)* %arrayidx, align 4, !tbaa !10
  %arrayidx3 = getelementptr inbounds float addrspace(1)* %b, i32 %i.09
  store float %conv, float addrspace(1)* %arrayidx3, align 4, !tbaa !10
  %inc = add nuw nsw i32 %i.09, 1
  %exitcond = icmp eq i32 %inc, 1000
  br i1 %exitcond, label %for.end, label %for.body

for.end:                                          ; preds = %for.body
  ret void
}

One to one translation is a little bit hard to do such optimization. But if we can handle icmp eq ... and br ... in the same time, better optimization can be implemented in translation phase to vc4.

We have two choice:

Peephole optimization might be general (it can be used for other bit-operation optimization), but latter is easier.

nomaddo commented 6 years ago

One more thing: to jump neightbor basic_block, this cod output jump instruction. brr.ifallzc (pc+4) + 4 can be deleted in this case.

doe300 commented 6 years ago

You are right, with additional optimizations, this could be reduced to something like:

xor.setf -, r1, ra1 xor.ifzc.setf -, 1 (1), 1 (1); v8min.ifz.setf -, 1 (1), 1 (1) brr.ifallzc (pc+4) + 4 ...

Scratch that. According to here, we cannot set flags on both ALUs depending on the condition of their execution, thus xor.ifzc.setf -, 1 (1), 1 (1); v8min.ifz.setf -, 1 (1), 1 (1) won't work. Also the or.ifz r1, r1, r1 is original a i32 %18 = i32 %17 and therefore the same issue as discussed in #14.

nomaddo commented 6 years ago

@doe300 you already done it? If so, let's close the issue.

doe300 commented 6 years ago

No, I haven't found a fitting optimization yet.