FluxML / NNlib.jl

Neural Network primitives with multiple backends
Other
201 stars 121 forks source link

always put the index on GPU if the soure array is there when using `gather` #415

Open YichengDWu opened 2 years ago

YichengDWu commented 2 years ago
using CUDA, NNlib
using Flux:gpu #this imports NNlibCUDA
using BenchmarkTools
CUDA.allowscalar(false)

a = CUDA.rand(200,3000,64)
idx = rand(1:64,500)
idx_gpu = idx |> gpu

@benchmark  CUDA.@sync NNlib.gather(a, idx)
@benchmark CUDA.@sync NNlib.gather(a, idx_gpu)

BenchmarkTools.Trial: 129 samples with 1 evaluation.
 Range (min … max):  12.639 ms … 49.596 ms  ┊ GC (min … max): 0.00% … 1.82%
 Time  (median):     37.477 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   38.929 ms ±  4.242 ms  ┊ GC (mean ± σ):  0.43% ± 0.75%

                                         ▄█
  ▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂▁███▅▅▅▃▂▁▁▁▁▂▄▄▄▁▃▃▂▃ ▂
  12.6 ms         Histogram: frequency by time        48.8 ms <

 Memory estimate: 1.36 KiB, allocs estimate: 24.

BenchmarkTools.Trial: 131 samples with 1 evaluation.
 Range (min … max):  14.508 ms … 48.818 ms  ┊ GC (min … max): 0.00% … 1.83%
 Time  (median):     37.054 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   38.231 ms ±  3.701 ms  ┊ GC (mean ± σ):  0.42% ± 0.77%

                                        ▃█▂
  ▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▄███▃▃▂▁▁▂▁▃▃▃▄▄▂▃▁▁▁▂ ▂
  14.5 ms         Histogram: frequency by time        47.9 ms <

 Memory estimate: 1.36 KiB, allocs estimate: 24.

There's no major difference as it appears to me. And moving the index to GPU could cause problems as mentioned in #411

YichengDWu commented 2 years ago

This would make a difference

using CUDA, NNlib
using Flux:gpu
using BenchmarkTools
CUDA.allowscalar(false)

a = CUDA.rand(200,30,64)
idx = rand(1:64,100000)
idx_gpu = idx |> gpu

@benchmark  CUDA.@sync NNlib.gather(a, idx)
@benchmark CUDA.@sync NNlib.gather(a, idx_gpu)

But I'm not sure how much acceleration is due to no bounds checking in https://github.com/FluxML/NNlibCUDA.jl/blob/master/src/gather.jl#L52

Once we have bounds checking there, we should always automatically move the index of large size to GPU if the array is on GPU, or least suggest users to do so

YichengDWu commented 2 years ago

To conclude: we should always put (large) index on GPU when srt is a CUDA array.