Closed nomaddo closed 6 years ago
This sounds like a good idea, especially because it almost always combines 2 to 4 blocks (start_of_function
with first label and most of the time also last label with end_of_function
)
So the rules to check should be something like:
If we extend the checks to also include unconditional branches (instead of only fall-through), we would need to re-order the basic blocks and possibly insert new branches to keep the structure correct. This could lead the path to a general optimization-step reordering basic blocks to keep the number of jumps minimal...
Anyway, this should wait until after #70 is done.
You don't concatenate %start_of_function
and %end_of_function
?
In this case, we can concatenate them, and it will have effect on scheduling, which fill delay-slots.
$ ./build/VC4C --asm -o /tmp/hoge testing/bugs/35_benchmark.cl
...
// Module with 1 kernels, global data with 0 words (64-bit each), starting at offset 1 words and 0 words of stack-frame
// Kernel 'sum' with 45 instructions, offset 2, with following parameters: __global const float* a (4 B, 1 items), __global const float* b (4 B, 1 items), __global out float* c (4 B, 1 items) (lSize, lids, gidX, offX)
// label: %start_of_function
or r2, unif, unif
or r1, unif, unif
or r3, unif, unif
or ra0, unif, unif
or ra3, unif, unif
or ra2, unif, unif
or ra1, unif, unif
ldi r0, 255
and r2, r2, r0
and r1, r1, r0
mul24 r0, r3, r2
add r0, ra0, r0
add ra0, r0, r1
nop.never
or r1, ra0, ra0
shl r0, r1, 2 (2)
add tmu0s, ra3, r0
nop.load_tmu0.never
or r3, r4, r4
shl r0, r1, 2 (2)
add tmu0s, ra2, r0
nop.load_tmu0.never
fadd r0, r3, r4
fmul r2, r0, r0
fmul r1, r3, 0.500000 (47)
or r0, ra0, ra0
shl r0, r0, 2 (2)
or -, mutex_acq, mutex_acq
ldi vpw_setup, vpm_setup(size: 16 words, stride: 1 rows, address: h32(0))
fmul vpm, r1, r2
ldi vpw_setup, vdw_setup(rows: 1, elements: 1 words, address: h32(0))
ldi vpw_setup, vdw_setup(stride: 0)
add vpw_addr, ra1, r0
or -, vpw_wait, vpw_wait
or mutex_rel, 1 (1), 1 (1)
// label: %end_of_function
or r0, unif, unif
or.setf -, elem_num, r0
brr.ifallzc (pc+4) + -41 // to %start_of_function
nop.never
nop.never
nop.never
not irq, qpu_num
nop.thrend.never
nop.never
nop.never
Yeah, merging %end_of_function
has the problem, that the insertion for the thread-end instructions no longer knows, where to insert them. So for know, I skip this.
Thanks. And I want to add optimization options to enable and disable it :)
Concatenation of
basic_block
s promotes other optimizations, because a lot of optimizations work only in a basic block.In this case, basic block
start_of_function
andtmp.0
can be fused becausetmp.0
is not pointed by other jump instructions and the next basic_block ofstart_of_function
is alwaystmp.0
.