JuliaGPU / AMDGPU.jl

AMD GPU (ROCm) programming in Julia
Other
278 stars 40 forks source link

@inbounds not propagating correctly #342

Open torrance opened 1 year ago

torrance commented 1 year ago

@inbounds applied against the kernel function definition has no effect.

Additionally, @inbounds does not propagate through function calls within a kernel, for example by calling zip().

The following benchmarks from https://github.com/torrance/AMDGPU-MWE/blob/main/inbounds.jl demonstrate the performance penalty. Note that the 3rd benchmark is likely doubly penalised since the call to zip() isn't inlined.

function @inbounds => @inbounds annotated at function definition internal @inbounds => @inbounds annotated at lines with indexing operations using zip() => using a zip() to iterate and index into arrays

Function @inbounds
BenchmarkTools.Trial: 18 samples with 1 evaluation.
 Range (min … max):  283.219 ms … 287.235 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     283.964 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   284.278 ms ± 874.447 μs  ┊ GC (mean ± σ):  0.10% ± 0.29%

            ▁█                                                   
  ▄▁▁▁▁▁▁▁▄▄██▁▁▁▁▄▄▁▁▄▁▁▁▄▁▁▁▁▁▁▁▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▄ ▁
  283 ms           Histogram: frequency by time          287 ms <

 Memory estimate: 6.21 MiB, allocs estimate: 406760.

Internal @inbounds
BenchmarkTools.Trial: 36 samples with 1 evaluation.
 Range (min … max):  141.340 ms … 141.616 ms  ┊ GC (min … max): 1.78% … 0.00%
 Time  (median):     141.471 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   141.469 ms ±  69.181 μs  ┊ GC (mean ± σ):  0.10% ± 0.42%

           ▃      ▃▃ ▃        ▃▃ █        ▃                      
  ▇▁▁▁▁▁▁▇▇█▁▇▇▁▁▁██▇█▁▁▇▇▁▁▁▁██▇█▇▁▁▁▁▇▁▁█▇▁▇▇▁▁▁▇▁▇▁▇▁▁▁▁▇▁▁▇ ▁
  141 ms           Histogram: frequency by time          142 ms <

 Memory estimate: 3.06 MiB, allocs estimate: 200490.

Using zip()
BenchmarkTools.Trial: 16 samples with 1 evaluation.
 Range (min … max):  318.848 ms … 319.049 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     318.942 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   318.950 ms ±  61.016 μs  ┊ GC (mean ± σ):  0.10% ± 0.28%

  ▁       ▁    ▁   █▁   ▁ ▁       ▁  ▁       ▁█        ▁     ▁▁  
  █▁▁▁▁▁▁▁█▁▁▁▁█▁▁▁██▁▁▁█▁█▁▁▁▁▁▁▁█▁▁█▁▁▁▁▁▁▁██▁▁▁▁▁▁▁▁█▁▁▁▁▁██ ▁
  319 ms           Histogram: frequency by time          319 ms <
jpsamaroo commented 1 year ago

Does this work on CUDA? If so, I can take a look at how they do it and try to mirror their implementation.

torrance commented 1 year ago

Does this work on CUDA? If so, I can take a look at how they do it and try to mirror their implementation.

@jpsamaroo In fact you're right, my benchmarking shows it also fails to work with CUDA.jl. The speed is is:

(inline @inbounds) < (function @inbounds) == (no @inbounds) < (zip with @inbounds)

Should this be an issue raised with GPUCompiler? Or...?

jpsamaroo commented 1 year ago

Should this be an issue raised with GPUCompiler? Or...?

Yeah, that seems like the play.