doe300 / VC4C

Compiler for the VC4CL OpenCL implementation
MIT License
118 stars 37 forks source link

Concatenate `basic_block`s #71

Closed nomaddo closed 6 years ago

nomaddo commented 6 years ago

Concatenation of basic_blocks promotes other optimizations, because a lot of optimizations work only in a basic block.

// label: start_of_function
...
// label: tmp.0
...
// label: tmp.1
...
br tmp.1
// label: end_of_function
...
br start_of_function

In this case, basic block start_of_function and tmp.0 can be fused because tmp.0 is not pointed by other jump instructions and the next basic_block of start_of_function is always tmp.0.

doe300 commented 6 years ago

This sounds like a good idea, especially because it almost always combines 2 to 4 blocks (start_of_function with first label and most of the time also last label with end_of_function)

So the rules to check should be something like:

If we extend the checks to also include unconditional branches (instead of only fall-through), we would need to re-order the basic blocks and possibly insert new branches to keep the structure correct. This could lead the path to a general optimization-step reordering basic blocks to keep the number of jumps minimal...

Anyway, this should wait until after #70 is done.

nomaddo commented 6 years ago

You don't concatenate %start_of_function and %end_of_function? In this case, we can concatenate them, and it will have effect on scheduling, which fill delay-slots.

$ ./build/VC4C --asm -o /tmp/hoge testing/bugs/35_benchmark.cl 
...
// Module with 1 kernels, global data with 0 words (64-bit each), starting at offset 1 words and 0 words of stack-frame
// Kernel 'sum' with 45 instructions, offset 2, with following parameters: __global const float* a (4 B, 1 items), __global const float* b (4 B, 1 items), __global out float* c (4 B, 1 items) (lSize, lids, gidX, offX)
// label: %start_of_function
or r2, unif, unif
or r1, unif, unif
or r3, unif, unif
or ra0, unif, unif
or ra3, unif, unif
or ra2, unif, unif
or ra1, unif, unif
ldi r0, 255
and r2, r2, r0
and r1, r1, r0
mul24 r0, r3, r2
add r0, ra0, r0
add ra0, r0, r1
nop.never 
or r1, ra0, ra0
shl r0, r1, 2 (2)
add tmu0s, ra3, r0
nop.load_tmu0.never 
or r3, r4, r4
shl r0, r1, 2 (2)
add tmu0s, ra2, r0
nop.load_tmu0.never 
fadd r0, r3, r4
fmul r2, r0, r0
fmul r1, r3, 0.500000 (47)
or r0, ra0, ra0
shl r0, r0, 2 (2)
or -, mutex_acq, mutex_acq
ldi vpw_setup, vpm_setup(size: 16 words, stride: 1 rows, address: h32(0))
fmul vpm, r1, r2
ldi vpw_setup, vdw_setup(rows: 1, elements: 1 words, address: h32(0))
ldi vpw_setup, vdw_setup(stride: 0)
add vpw_addr, ra1, r0
or -, vpw_wait, vpw_wait
or mutex_rel, 1 (1), 1 (1)
// label: %end_of_function
or r0, unif, unif
or.setf -, elem_num, r0
brr.ifallzc (pc+4) + -41 // to %start_of_function
nop.never 
nop.never 
nop.never 
not irq, qpu_num
nop.thrend.never 
nop.never 
nop.never 
doe300 commented 6 years ago

Yeah, merging %end_of_function has the problem, that the insertion for the thread-end instructions no longer knows, where to insert them. So for know, I skip this.

nomaddo commented 6 years ago

Thanks. And I want to add optimization options to enable and disable it :)