IntelLabs / ParallelAccelerator.jl

The ParallelAccelerator package, part of the High Performance Scripting project at Intel Labs
BSD 2-Clause "Simplified" License
294 stars 32 forks source link

Copy propagation improvement #131

Closed ehsantn closed 7 years ago

ehsantn commented 7 years ago

We need to improve copy/constant propagation. The code below is a good example. Let's improve it as much as possible.

using ParallelAccelerator
@acc function kernelscore_test(n)
    X = rand(n)
    points = [-1.0, 2.0, 5.0]
    N = size(points,1)
    b = 0.5 
    exps::Float64 = 0.0 
    @par exps(+) for i in 1:length(X)
        d = -(X[i]-points).^2./(2*b^2)
        exps += minimum(d)-log(b*N)+log(sum(exp(d-minimum(d))))
    end 
    return exps
end
println(kernelscore_test(1024))

For example, constants b and N are not propagated. The code d = -(X[i]-points).^2./(2*b^2) becomes two big loops but it could be one simple loop:

for ( parfor_index_1_17 = 1; parfor_index_1_17 <= (int64_t)parallel_ir_save_array_len_1_17; parfor_index_1_17 += 1) 
{ 
                    ; 
                    parallel_ir_array_temp__6_18_1 = points.ARRAYELEM(parfor_index_1_17); 
                    SSAValue34 = (SSAValue23pp1) - (parallel_ir_array_temp__6_18_1); 
                    parallel_ir_array_temp__23_20_1 = SSAValue34; 
                    parallel_ir_array_temp_SSAValue24_22_1 = parallel_ir_array_temp__23_20_1; 
                    SSAValue35 = (int32_t)(2); 
                    pow(parallel_ir_array_temp_SSAValue24_22_1, SSAValue35); 
                    SSAValue36 = pow(parallel_ir_array_temp_SSAValue24_22_1, SSAValue35); 
                    parallel_ir_array_temp_SSAValue24_24_2 = SSAValue36; 
                    parallel_ir_array_temp_SSAValue25_27_1 = parallel_ir_array_temp_SSAValue24_24_2; 
                    SSAValue37 = -(parallel_ir_array_temp_SSAValue25_27_1); 
                    parallel_ir_array_temp_SSAValue25_29_2 = SSAValue37; 
                    parallel_ir_new_array_name_17_1.ARRAYELEM(parfor_index_1_17) = parallel_ir_array_temp_SSAValue25_29_2; 
                } 
                ; 
                SSAValue20 = parallel_ir_new_array_name_17_1; 
                SSAValue29 = (int32_t)(2); 
                pow(b, SSAValue29); 
                SSAValue22 = pow(b, SSAValue29); 
                SSAValue30 = (double)2; 
                SSAValue32pp2 = (SSAValue30) * (SSAValue22); 
                parallel_ir_save_array_len_1_31 = SSAValue20.ARRAYSIZE(1); 
                parallel_ir_reduction_output_35 = DBL_MAX; 
                for ( parfor_index_1_31 = 1; parfor_index_1_31 <= (int64_t)parallel_ir_save_array_len_1_31; parfor_index_1_31 += 1) 
                { 
                    ; 
                    parallel_ir_array_temp_SSAValue4_32_1 = SSAValue20.ARRAYELEM(parfor_index_1_31); 
                    SSAValue38 = (parallel_ir_array_temp_SSAValue4_32_1) / (SSAValue32pp2); 
                    parallel_ir_array_temp_SSAValue4_34_2 = SSAValue38; 
                    SSAValue20.ARRAYELEM(parfor_index_1_31) = parallel_ir_array_temp_SSAValue4_34_2; 
                    parallel_ir_array_temp__4_37_1 = parallel_ir_array_temp_SSAValue4_34_2; 
                    SSAValue39 = parallel_ir_array_temp__4_37_1; 
                    SSAValue40 = parallel_ir_reduction_output_35; 
                    SSAValue41 = (SSAValue40) < (0); 
                    SSAValue42 = (SSAValue39) < (0); 
                    SSAValue43 = !(SSAValue41); 
                    (SSAValue42) & (SSAValue43); 
                    SSAValue44 = (parallel_ir_array_temp__4_37_1) < (parallel_ir_reduction_output_35); 
                    SSAValue45 = (SSAValue42) & (SSAValue43); 
                    (SSAValue44) | (SSAValue45); 
                    SSAValue46 = (parallel_ir_array_temp__4_37_1) != (parallel_ir_array_temp__4_37_1); 
                    SSAValue47 = (parallel_ir_reduction_output_35) != (parallel_ir_reduction_output_35); 
                    SSAValue48 = (SSAValue44) | (SSAValue45); 
                    SSAValue49 = (SSAValue46) ? (parallel_ir_reduction_output_35) : (parallel_ir_array_temp__4_37_1); 
                    SSAValue50 = (SSAValue47) ? (parallel_ir_array_temp__4_37_1) : (parallel_ir_reduction_output_35); 
                    SSAValue51 = (SSAValue48) ? (SSAValue49) : (SSAValue50); 
                    parallel_ir_reduction_output_35 = SSAValue51; 
                } 
                ; 
d = SSAValue20;
SSAValue21 = parallel_ir_reduction_output_35;
                SSAValue31 = (double)N;
                SSAValue32 = (b) * (SSAValue31);
                SSAValue23 = _ParallelAcceleratorAPI_log(SSAValue32);
                SSAValue24 = (SSAValue21) - (SSAValue23);
                parallel_ir_save_array_len_1_40 = d.ARRAYSIZE(1);
                parallel_ir_reduction_output_40 = DBL_MAX;
DrTodd13 commented 7 years ago

I think I have this down to the minimum number of parfors (5).

  1. rand(n)
  2. @ par
    1. d = -(X[i]-points).^2./(2*b^2) 4 & 5. The two calls to minimum(d).

The latter two raise the possibility of doing CSE to have the minimum(d) done only once.

Another issue is that 2*b^2 can be elevated not just to the beginning of @ par but outside of the @ par. The current system cannot do this. So, this issue and the copy propagation are sort of duels....one going into the @ par and the other coming out of it.

ehsantn commented 7 years ago

Since 2*b^2 is constant, ideally it should be replaced with its value.

ehsantn commented 7 years ago

I think this issue is resolved now.