doe300 / VC4C

Compiler for the VC4CL OpenCL implementation
MIT License
118 stars 37 forks source link

Remove redundant instructions #47

Closed nomaddo closed 6 years ago

nomaddo commented 6 years ago

Discussed in #45.

__kernel void loop1 (__global float * a) {
  int id = get_local_id(0);
  float16 v = vload16 (id, a);
  vstore16(v * 2, id, a);
}
// Module with 1 kernels, global data with 0 words (64-bit each), starting at offset 1 words and 0 words of stack-frame
// Kernel 'loop1' with 37 instructions, offset 2, with following parameters: __global float* a (4 B, 1 items) (lids)
or r0, unif, unif
or ra0, unif, unif
or r0, r0, r0
ldi r1, 255         // 
and r0, r0, r1      // Redundant 
and r0, r0, r1      // 
shl r0, r0, 4 (4)
shl r0, r0, 2 (2)
add r0, ra0, r0
or r2, r0, r0
mul24 r1, 4 (4), elem_num
v8adds r0, 8 (8), 8 (8)
sub.setf -, elem_num, r0
or.ifnc r0, 0 (0), 0 (0)
add.ifn r0, r2, r1
or tmu0s, r0, r0
nop.load_tmu0.never
or r0, 2.000000 (33), 2.000000 (33)
fmul r0, r4, r0
or -, mutex_acq, mutex_acq
ldi vpw_setup, vpm_setup(size: 16 words, stride: 1 rows, address: h32(0))
or vpm, r0, r0
ldi vpw_setup, vdw_setup(rows: 1, elements: 16 words, address: h32(0))
ldi vpw_setup, vdw_setup(stride: 0)
or vpw_addr, r2, r2
or -, vpw_wait, vpw_wait
or mutex_rel, 1 (1), 1 (1)
or r0, unif, unif
or.setf -, elem_num, r0
brr.ifallzc (pc+4) + -33
nop.never
nop.never
nop.never
not irq, qpu_num
nop.thrend.never
nop.never
nop.never
nomaddo commented 6 years ago

Enhancement of General Purpose Optimization

General purpose optimizations are like:

Steps

  1. Implement DataFlow Graph: Attach information of

    • which instructions are affected by a given value (with consideration of re-definitons)
    • which instruction define a given value
  2. Implement General Porpose Optimization phases, which repeat the same optimizations until instructions are not changed (fix point). Ideally, it should be repeated like:

    • Translation from LLVM-IR to intermediate::instructions
    • Regulation
    • General Porpose Optimizations
    • Special (meaning adapt once) optimization
    • General Porpose Optimizations (repeat)
    • Loop optimizations
    • General Porpose Optimizations (repeat)
    • Code generation (scheduling, register allocation, etc...)
  3. Implement optimizations such that

    • Find redundant instructions
    • Then, replace it and changing all of instructions that relate to it (using DataFlow Graph)

To implement such optimizations, we need to repeat traversing CFG many times. We need to avoid such expensive costs to define DataFlow Graph.

Data Flow Graph behaves like "Cache". To avoid using old or wrong cache, interfaces of BasicBlock or Method must be carefully designed. This will be big change for the current implementation.

nomaddo commented 6 years ago

solved in #55