HSAFoundation / HSAIL-HLC-Stable

LLVM-based high-level-compiler (HLC) that generates HSAIL. The Stable version includes optimizations, atomics, and is provided in binary form. See also the HSAIL-HLC-Development version for other options.
Other
14 stars 3 forks source link

Compiler generates incorrect code when doing loop unrolling. #6

Open atgutier opened 9 years ago

atgutier commented 9 years ago

I have encountered a problem with some code that is generated when a loop is unrolled. For each iteration of the loop, the compiler is pre-computing the the condition and storing it in the spill stack, then for each iteration it loads the variable and performs a conditional branch cbr based on that variable. For some reason, in the code for the second iteration of the loop, and only this iteration, the compiler is inverting the condition variable with a not operation. This leads to the data not being stored in memory when the condition is supposed to execute, and a buffer overflow when it is not supposed to execute.

Unrolling the loop by hand works and the correct output is observed, which matches a previous version of HSAIL-HLC-Stable.

Here is an example of the loop and how it is unrolled.

CL code: for (int i = 0; i < 16; i++) { if (16 * tid + i < length) array[16 * tid + i] = sum[0] + array2[16 * lid + i]; }

HSAIL disassembled code:

//iteration 1 @BB7_49: // %if.end29 barrier; ld_spill_align(4)_u32 $s1, [%__spillStack][24]; cvt_b1_u32 $c0, $s1; cbr_b1 $c0, @BB7_51; // BB#50: // %if.then39 cvt_s64_s32 $d3, $s21; shl_u64 $d3, $d3, 2; add_u64 $d19, $d1, $d3; cvt_u32_u64 $s1, $d2; ld_group_align(4)_u32 $s1, [$s1]; ld_group_align(4)_width(WAVESIZE)_u32 $s3, [%__hsa_replaced_Kernel_sum_0_0]; add_u32 $s1, $s1, $s3; st_global_align(4)_u32 $s1, [$d19];

//iteration 2 @BB7_51: // %for.inc50 ld_spill_align(4)_u32 $s1, [%spillStack][28]; cvt_b1_u32 $c0, $s1; not_b1 $c0, $c0; //incorrect inversion cbr_b1 $c0, @BB7_53; // BB#52: // %if.then39.1 cvt_s64_s32 $d3, $s22; shl_u64 $d3, $d3, 2; add_u64 $d19, $d1, $d3; cvt_u32_u64 $s1, $d4; ld_group_align(4)_u32 $s1, [$s1]; ld_group_align(4)_width(WAVESIZE)_u32 $s3, [%hsa_replaced_Kernel_sum_0_0]; add_u32 $s1, $s1, $s3; st_global_align(4)_u32 $s1, [$d19];

//iteration 3 @BB7_53: // %for.inc50.1 ld_spill_align(4)_u32 $s1, [%__spillStack][32]; cvt_b1_u32 $c0, $s1; cbr_b1 $c0, @BB7_55; // BB#54: // %if.then39.2 cvt_s64_s32 $d3, $s23; shl_u64 $d3, $d3, 2; add_u64 $d19, $d1, $d3; cvt_u32_u64 $s1, $d5; ld_group_align(4)_u32 $s1, [$s1]; ld_group_align(4)_width(WAVESIZE)_u32 $s3, [%__hsa_replaced_Kernel_sum_0_0]; add_u32 $s1, $s1, $s3; st_global_align(4)_u32 $s1, [$d19];

etc...