Extremely slow reduction

maleadt commented 2 years ago

On a 1024x1024 Float32 matrix:

julia> @benchmark sum($a)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  84.734 μs … 228.946 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     85.332 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   87.917 μs ±   7.545 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █▇▃▂▄▁▁                                                      ▁
  ████████▇██▇▇█▇▆█████▇██████▇▆▇▆▆▅▆▆▆▆▆▆▆▆▅▅▅▆▆▄▇█▆▅▅▅▆▄▄▃▅▄ █
  84.7 μs       Histogram: log(frequency) by time       120 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark sum($d_a)
BenchmarkTools.Trial: 618 samples with 1 evaluation.
 Range (min … max):  6.966 ms …   9.740 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     8.047 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   8.079 ms ± 602.087 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

       ▁▁▄▃▄  ▁ ▁▇ ▅▄█▃▄            ▂ ▁ ▃▁▇▁ ▁ ▁▁  ▃           
  ▂▂▃▃▄█████▇▇█▇██▇█████▄▇▆▆▃█▅▇▇▆▆███████████▅█████▃▄▃▆▂▅▄▂▃ ▅
  6.97 ms         Histogram: frequency by time        9.28 ms <

 Memory estimate: 27.41 KiB, allocs estimate: 509.

It scales, so this is probably the kernel being bad:

julia> d_a = oneArray(rand(Float32, 4096, 4096));

julia> a = rand(Float32, 4096, 4096);

julia> @benchmark sum($a)
BenchmarkTools.Trial: 1682 samples with 1 evaluation.
 Range (min … max):  2.918 ms …  3.185 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     2.964 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   2.967 ms ± 25.760 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

            ▅  █      ▆  ▃                                    
  ▂▁▁▃▂▂▇▅▃▇█▃▃█▃▂██▃██▄▄█▇▃▄▇▂▃▇▃▂▅▅▂▃▄▂▂▃▂▂▃▃▂▂▃▂▁▃▂▂▂▂▁▁▂ ▃
  2.92 ms        Histogram: frequency by time        3.05 ms <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark sum($d_a)
BenchmarkTools.Trial: 45 samples with 1 evaluation.
 Range (min … max):  112.776 ms … 113.728 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     113.151 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   113.186 ms ± 218.961 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▁       ▁       ▁ ▄   ▄█▁ ▄▁       ▁▁    ▁                     
  █▁▁▆▆▁▁▁█▆▁▁▁▁▁▁█▆█▆▁▆███▁██▁▁▁▆▆▆▆██▁▁▁▆█▆▁▁▁▁▆▁▁▆▁▁▁▁▁▆▁▁▁▆ ▁
  113 ms           Histogram: frequency by time          114 ms <

 Memory estimate: 28.75 KiB, allocs estimate: 516.

maleadt commented 2 years ago

Definitely the kernel:

maleadt commented 2 years ago

Occupancy of the main kernel is good at 95%, EUs are 93% active, so I'm wondering if I'm doing something fundamentally wrong here.

freemin7 commented 2 years ago

So something improved since then:

julia> @benchmark sum(da)
BenchmarkTools.Trial: 106 samples with 1 evaluation.
 Range (min … max):  47.019 ms …  47.963 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     47.258 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   47.297 ms ± 166.846 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

        ▁ ▁  █▄▁▃ ▃▁▁ ▃  ▃▄       ▃                             
  ▆▁▁▄▄▁█▄█▆▆████▇███▁█▇▇██▄▆▇▄▁▁▆█▄▄▄▁▆▁▄▄▁▄▁▁▁▁▁▁▄▁▁▆▄▁▁▁▁▁▄ ▄
  47 ms           Histogram: frequency by time         47.8 ms <

 Memory estimate: 29.81 KiB, allocs estimate: 515.

On the same hardware: ZeDevice(GPU, vendor 0x8086, device 0x3e96): Intel(R) UHD Graphics P630 [0x3e96]

Some uneducated guesses what might be going wrong:

https://github.com/JuliaGPU/oneAPI.jl/blob/fa26e213e8d7a7f4fad4178879aa6af12dae99c2/src/mapreduce.jl#L61-L69 I interpret that as we compute the neutral element in every unit of compute. There is indeed a fadd %x %x in the kernel related to branch.
Every length(Rother) seems to involve two memory accesses and a multiplication. Julia doesn't seem to exploit the fact that length(Rother) is constant in map reduces. Calculation length(Rother) et al. on the CPU and passing it in might help.
I am probably wrong but i feel like there is a barrier missing here

maleadt commented 2 years ago

Those are valid concerns, but I doubt that they are responsible for the huge slowdown. The reduce implementation is taken from CUDA.jl, where it performs well.

freemin7 commented 2 years ago

I will poke at it a bit. There are differences between CUDA and SYCL though https://sycl.tech/assets/files/Michel_Migdal_Codeplay_Porting_Tips_CDUA_To_SYCL.pdf some off them sound relevant.

maleadt commented 2 years ago

That looks like an interesting document. I haven't had the time yet to optimize oneAPI.jl, only focusing on features right now, so it would be great if you would have the time to take a look :-) Let me know if there's anything I can help with.

maleadt commented 1 year ago

FWIW, on an A770 (vs a 5950X):

julia> @benchmark sum($a)
BenchmarkTools.Trial: 2677 samples with 1 evaluation.
 Range (min … max):  1.794 ms …  2.207 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     1.878 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   1.862 ms ± 37.677 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

                               ▃▂█▇▄▃
  ▃▄▅▆▆▇▇▆▅▅▄▄▃▃▂▂▂▂▂▂▂▂▂▁▂▂▂▃▇███████▆▅▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▂▂ ▃
  1.79 ms        Histogram: frequency by time        1.95 ms <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark sum($d_a)
BenchmarkTools.Trial: 1105 samples with 1 evaluation.
 Range (min … max):  1.718 ms … 11.036 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     4.735 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   4.527 ms ±  2.293 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █▃▂               ▁ ▁
  ███▃▁▁▂▁▁▁▁▁▁▁▁▁▁▁█▆█▇█▇▃▅▄▃▃▃▂▂▂▂▄▄▃▄▄▄▃▂▂▁▂▁▁▁▂▂▁▁▃▃▃▄▃▂ ▃
  1.72 ms        Histogram: frequency by time        10.2 ms <

 Memory estimate: 31.45 KiB, allocs estimate: 588.

So still way too slow, but at least not outperformed by the CPU...

JuliaGPU / oneAPI.jl

Extremely slow reduction #110