JuliaGPU / CUDA.jl

CUDA programming in Julia.
https://juliagpu.org/cuda/
Other
1.2k stars 215 forks source link

Slow 2D sum #1330

Open Cvikli opened 2 years ago

Cvikli commented 2 years ago

Coming from this observation #issue 1323.

Having a code:

using BenchmarkTools
using CUDA
function mysum(X,Y,n) 
    I = (blockIdx().x - 1) * blockDim().x + threadIdx().x
    I > n && return
    cY = CUDA.Const(Y)
    @inbounds for j in 1:n
        X[I] += cY[I,j]
    end
    return
end

# MYSUM PERF. TEST.
nth=512
N = 500; A = CUDA.randn(Float32, N, N); B = CUDA.randn(Float32, N, 1); (@btime CUDA.@sync @cuda threads=nth blocks=cld(N*N, nth)  mysum($B,$A,$N)) 
N = 1000; A = CUDA.randn(Float32, N, N); B = CUDA.randn(Float32, N, 1); (@btime CUDA.@sync @cuda threads=nth blocks=cld(N*N, nth)  mysum($B,$A,$N)) 
N = 2000; A = CUDA.randn(Float32, N, N); B = CUDA.randn(Float32, N, 1); (@btime CUDA.@sync @cuda threads=nth blocks=cld(N*N, nth)  mysum($B,$A,$N)) 
N = 4000; A = CUDA.randn(Float32, N, N); B = CUDA.randn(Float32, N, 1); (@btime CUDA.@sync @cuda threads=nth blocks=cld(N*N, nth)  mysum($B,$A,$N)) 
N = 6000; A = CUDA.randn(Float32, N, N); B = CUDA.randn(Float32, N, 1); (@btime CUDA.@sync @cuda threads=nth blocks=cld(N*N, nth)  mysum($B,$A,$N)) 
N = 8000; A = CUDA.randn(Float32, N, N); B = CUDA.randn(Float32, N, 1); (@btime CUDA.@sync @cuda threads=nth blocks=cld(N*N, nth)  mysum($B,$A,$N)) 
N = 10000; A = CUDA.randn(Float32, N, N); B = CUDA.randn(Float32, N, 1); (@btime CUDA.@sync @cuda threads=nth blocks=cld(N*N, nth)  mysum($B,$A,$N)) 
N = 12000; A = CUDA.randn(Float32, N, N); B = CUDA.randn(Float32, N, 1); (@btime CUDA.@sync @cuda threads=nth blocks=cld(N*N, nth)  mysum($B,$A,$N)) 

# BUILTIN PERF. TEST.
N = 500; A = CUDA.randn(Float32, N, N); (@btime CUDA.@sync sum($A, dims=2))
N = 1000; A = CUDA.randn(Float32, N, N); (@btime CUDA.@sync sum($A, dims=2))
N = 2000; A = CUDA.randn(Float32, N, N); (@btime CUDA.@sync sum($A, dims=2))
N = 4000; A = CUDA.randn(Float32, N, N); (@btime CUDA.@sync sum($A, dims=2))
N = 6000; A = CUDA.randn(Float32, N, N); (@btime CUDA.@sync sum($A, dims=2))
N = 8000; A = CUDA.randn(Float32, N, N); (@btime CUDA.@sync sum($A, dims=2))
N = 10000; A = CUDA.randn(Float32, N, N); (@btime CUDA.@sync sum($A, dims=2))
N = 12000; A = CUDA.randn(Float32, N, N); (@btime CUDA.@sync sum($A, dims=2))

Gives us this results:

  27.702 μs (44 allocations: 1.84 KiB)
  80.011 μs (34 allocations: 1.44 KiB)
  155.594 μs (168 allocations: 5.62 KiB)
  323.131 μs (40 allocations: 1.62 KiB)
  576.200 μs (314 allocations: 10.19 KiB)
  885.445 μs (630 allocations: 20.06 KiB)
  1.408 ms (988 allocations: 31.25 KiB)
  1.965 ms (1230 allocations: 38.81 KiB)

  61.396 μs (134 allocations: 4.84 KiB)
  144.424 μs (266 allocations: 8.95 KiB)
  333.410 μs (321 allocations: 10.67 KiB)
  968.642 μs (1165 allocations: 37.05 KiB)
  2.979 ms (1567 allocations: 49.61 KiB)
  5.656 ms (2234 allocations: 70.45 KiB)
  8.874 ms (10531 allocations: 329.73 KiB)
  13.106 ms (19971 allocations: 624.73 KiB)

Possibly: 200-1000% speedup.

Why are we faster with "mysum"? Is it the CUDA.const? Or the specialisation on dims=2? Or we don't see something?

Cvikli commented 2 years ago

Or... the preallocation! Damn, I am comparing different operation. :(

I will be back with the right measurement.

Cvikli commented 2 years ago

This shows a little bit better speed without preallocation:

N = 500; A = CUDA.randn(Float32, N, N); B = CUDA.randn(Float32, N, 1); (@btime CUDA.@sync sum!($B, $A))
N = 1000; A = CUDA.randn(Float32, N, N); B = CUDA.randn(Float32, N, 1); (@btime CUDA.@sync sum!($B, $A))
N = 2000; A = CUDA.randn(Float32, N, N); B = CUDA.randn(Float32, N, 1); (@btime CUDA.@sync sum!($B, $A))
N = 4000; A = CUDA.randn(Float32, N, N); B = CUDA.randn(Float32, N, 1); (@btime CUDA.@sync sum!($B, $A))
N = 6000; A = CUDA.randn(Float32, N, N); B = CUDA.randn(Float32, N, 1); (@btime CUDA.@sync sum!($B, $A))
N = 8000; A = CUDA.randn(Float32, N, N); B = CUDA.randn(Float32, N, 1); (@btime CUDA.@sync sum!($B, $A))
N = 10000; A = CUDA.randn(Float32, N, N); B = CUDA.randn(Float32, N, 1); (@btime CUDA.@sync sum!($B, $A))
N = 12000; A = CUDA.randn(Float32, N, N); B = CUDA.randn(Float32, N, 1); (@btime CUDA.@sync sum!($B, $A))

New results:

  41.969 μs (73 allocations: 2.47 KiB)
  95.652 μs (152 allocations: 4.94 KiB)
  246.136 μs (207 allocations: 6.66 KiB)
  921.868 μs (1572 allocations: 49.31 KiB)
  2.781 ms (2101 allocations: 65.84 KiB)
  5.428 ms (5666 allocations: 177.25 KiB)
  8.370 ms (10369 allocations: 324.22 KiB)
 12.587 ms (16901 allocations: 528.34 KiB)

But actually still notable difference.

Cvikli commented 2 years ago

Don't want to mix things, but running the bulitin sum(..., dims=2) and mean(..., dims=2) gives nearly exactly the same speed every testcases. Does this help for you to understand what can be the problem?

maleadt commented 2 years ago

running the bulitin sum(..., dims=2) and mean(..., dims=2) gives nearly exactly the same speed every testcases

That's expected, right? mean is just sum divided by length.

Cvikli commented 2 years ago

Yes, definitely. I just wanted to make sure that the sum is really important as there are really multiple aggregation algorithm that can depend on that.

maleadt commented 2 years ago

I wonder if it's just a launch configuration issue again, because generally CUDA.jl uses the max threads returned by the occupancy API, likely 1024 threads. Could you compare against that?

Cvikli commented 2 years ago

You mean test with 64/256/512/1024 threads and compare each testcases?

maleadt commented 2 years ago

Yeah. nth=1024 would compare with CUDA.jl's launch configuration.

Cvikli commented 2 years ago

I got the same results. The code I ran:

nth=1024
N = 500; A = CUDA.randn(Float32, N, N); B = CUDA.randn(Float32, N, 1); (@btime CUDA.@sync @cuda threads=nth blocks=cld(N*N, nth)  mysum($B,$A,$N)) 
N = 1000; A = CUDA.randn(Float32, N, N); B = CUDA.randn(Float32, N, 1); (@btime CUDA.@sync @cuda threads=nth blocks=cld(N*N, nth)  mysum($B,$A,$N)) 
N = 2000; A = CUDA.randn(Float32, N, N); B = CUDA.randn(Float32, N, 1); (@btime CUDA.@sync @cuda threads=nth blocks=cld(N*N, nth)  mysum($B,$A,$N)) 
N = 4000; A = CUDA.randn(Float32, N, N); B = CUDA.randn(Float32, N, 1); (@btime CUDA.@sync @cuda threads=nth blocks=cld(N*N, nth)  mysum($B,$A,$N)) 
N = 6000; A = CUDA.randn(Float32, N, N); B = CUDA.randn(Float32, N, 1); (@btime CUDA.@sync @cuda threads=nth blocks=cld(N*N, nth)  mysum($B,$A,$N)) 
N = 8000; A = CUDA.randn(Float32, N, N); B = CUDA.randn(Float32, N, 1); (@btime CUDA.@sync @cuda threads=nth blocks=cld(N*N, nth)  mysum($B,$A,$N)) 
N = 10000; A = CUDA.randn(Float32, N, N); B = CUDA.randn(Float32, N, 1); (@btime CUDA.@sync @cuda threads=nth blocks=cld(N*N, nth)  mysum($B,$A,$N)) 
N = 12000; A = CUDA.randn(Float32, N, N); B = CUDA.randn(Float32, N, 1); (@btime CUDA.@sync @cuda threads=nth blocks=cld(N*N, nth)  mysum($B,$A,$N)) 
Tuebel commented 2 years ago

Hi,

my idea would be to turn the summation into a matrix vector multiplication, where the vector consists of ones. I have implemented and benchmarked a few different combinations:

  1. Reducing only one dimension via multiplication and then using sum
  2. Reducing both dimensions via multiplication
  3. Testing 1 again with a preallocated vector

Here is the code:

using CUDA
M_rand = CUDA.rand(Float32, 1000, 1001)
M_cpu = Array(M_rand)
v_ones = CUDA.ones(Float32, 1, size(M_rand)[1])
M_ones = CUDA.ones(Float32, size(M_rand))

function dot_sum(M)
    v = CUDA.ones(Float32, 1, size(M)[1])
    sum(v * M)
end
function dot_sum_full(M)
    v = CUDA.ones(Float32, 1, size(M)[1])
    reduced = v * M
    v2 = CUDA.ones(Float32, length(reduced))
    # Avoid scalar indexing warning
    reduce(+, reduced * v2)
end
function dot_sum_prealloc(v, M)
    sum(v * M)
end

using LinearAlgebra
@cuassert sum(M_rand) ≈ dot_sum(M_rand) ≈ dot_sum_full(M_rand) ≈ dot_sum_prealloc(v_ones, M_rand) ≈ dot(M_ones, M_rand)

using BenchmarkTools
@benchmark CUDA.@sync dot_sum(M_rand)
@benchmark CUDA.@sync dot_sum_full(M_rand)
@benchmark CUDA.@sync dot_sum_prealloc(v_ones, M_rand)

@benchmark CUDA.@sync sum(M_rand)
@benchmark CUDA.@sync dot(M_ones, M_rand)
@benchmark sum(M_cpu)

And the results

# @benchmark CUDA.@sync dot_sum(M_rand)
BenchmarkTools.Trial: 9906 samples with 1 evaluation.
 Range (min … max):  175.332 μs … 262.126 ms  ┊ GC (min … max): 0.00% … 9.63%
 Time  (median):     203.573 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   496.464 μs ±   2.749 ms  ┊ GC (mean ± σ):  0.51% ± 0.10%

  █▅▄▂▁                      ▁                      ▂▁          ▁
  █████▇▆▄▅▄▃▄▆▆▆█▇▇█▇▆▆▆▆▇▇███▇▅▄▄▄▁▁▄▄▅▄▃▃▅▃▄▆▇█▆▆████▇▆▅▃▄▆▆ █
  175 μs        Histogram: log(frequency) by time       3.79 ms <

 Memory estimate: 5.22 KiB, allocs estimate: 112.

# @benchmark CUDA.@sync dot_sum_full(M_rand)
BenchmarkTools.Trial: 9739 samples with 1 evaluation.
 Range (min … max):  178.480 μs … 8.318 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     199.840 μs             ┊ GC (median):    0.00%
 Time  (mean ± σ):   506.466 μs ± 1.029 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █▄▂▁                           ▁▁                           ▁
  ████▇▆▅▆▇█▇▇▇▆▇███▇▆▅▅▅▅▅▄▄▅▆▅▇███▇▆▆▅▅▅▅▄▅▅▅▄▅▄▆▅▆▆▆▅▄▁▅▆▅ █
  178 μs       Histogram: log(frequency) by time       6.1 ms <

 Memory estimate: 3.62 KiB, allocs estimate: 95.

# @benchmark CUDA.@sync dot_sum_prealloc(v_ones, M_rand)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  167.418 μs … 157.410 ms  ┊ GC (min … max): 0.00% … 11.92%
 Time  (median):     182.596 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   441.800 μs ±   1.753 ms  ┊ GC (mean ± σ):  0.42% ±  0.12%

  █▄▂                                          ▁▁               ▁
  ████▆▅▅▄▄▁▄▁▅▄▇▇▆▇▇▇▆▆▆▆▆▇▇██▇▇▅▅▄▅▅▅▆▅▄▆▇▇▇▆██▇█▇▆▄▄▄▃▁▁▃▄▃▃ █
  167 μs        Histogram: log(frequency) by time       4.21 ms <

 Memory estimate: 5.06 KiB, allocs estimate: 106.

# @benchmark CUDA.@sync sum(M_rand)
BenchmarkTools.Trial: 3969 samples with 1 evaluation.
 Range (min … max):  492.841 μs … 12.605 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     713.582 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):     1.245 ms ±  1.029 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▇▇█▇▆▄▁      ▁▁▁▂▂▁        ▁      ▁▁  ▁▂▃▃▂▂▁                ▁
  ███████▇█▆▄▆████████▇▆▇██████▇▆▇▆▇███████████▇▃▃▃▅▇█▆▅▄▃▆█▅▇ █
  493 μs        Histogram: log(frequency) by time      4.24 ms <

 Memory estimate: 4.59 KiB, allocs estimate: 82.

# @benchmark CUDA.@sync dot(M_ones, M_rand)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  248.969 μs …   4.300 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     262.834 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   473.993 μs ± 704.821 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █▄▁▂           ▁ ▁                              ▁          ▂▁ ▁
  ████▄█▇▃▄▁▄▃▅▃▅█▇█▇▅▅▅▅▅▅▆▅▄▆▆▁▃▁▄▁▁▁▁▁▁▁▁▁▁▁▁▁▅█▇▄▅▄▃▅▄▅▅▆██ █
  249 μs        Histogram: log(frequency) by time       3.24 ms <

 Memory estimate: 32 bytes, allocs estimate: 2.

# @benchmark sum(M_cpu)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  104.205 μs …  1.220 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     198.371 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   229.274 μs ± 94.172 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

     ▂█▇▇█▇▆▄▃▃▁                                                
  ▁▁▂█████████████▇▇▇▆▆▆▆▅▆▅▄▅▅▄▄▄▄▄▃▃▃▃▃▂▃▃▂▂▂▂▂▂▂▂▂▂▁▂▁▁▁▁▁▁ ▄
  104 μs          Histogram: frequency by time          539 μs <

 Memory estimate: 16 bytes, allocs estimate: 1.

So I'm getting CPU-like speeds with the implementations, which in my eyes is pretty good for a reduce operation on the GPU. And compared to the existing sum it is ~2.5 faster for the mean, but with a higher standard deviation.

ZongyuLi-umich commented 2 years ago

Hi,

I found the function sum! is running very slow on GPU and allocating more memory than expected. A minimum working example is

M_rand = CUDA.rand(Float32, 4, 4)
M_cpu = Array(M_rand)
v_ones = CUDA.ones(Float32, 1, 4)
v_cpu = Array(v_ones)
@benchmark CUDA.@sync sum!(v_ones, M_rand)
@benchmark sum!(v_cpu, M_cpu)
# @benchmark CUDA.@sync sum!(v_ones, M_rand)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  13.715 μs … 104.607 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     15.520 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   15.997 μs ±   2.174 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

     ▄▇██▇▆█▅▄▄▃▃▂▁▁                                            
  ▂▃▇████████████████▇▆▆▆▅▅▅▄▃▄▃▃▃▂▂▂▂▂▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▄
  13.7 μs         Histogram: frequency by time         22.7 μs <

 Memory estimate: 1.66 KiB, allocs estimate: 32.

# @benchmark sum!(v_cpu, M_cpu)
BenchmarkTools.Trial: 10000 samples with 993 evaluations.
 Range (min … max):  34.386 ns … 925.473 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     34.611 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   35.039 ns ±   9.019 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

   ▇█▅                     ▃▄▁                                 ▂
  █████▆▅▅▁▁▁▁▁▆▃▁▁▄▆▇▅▅▇▅▅███▆█▄▆▄▃▄▃▁▃▅█▇▁▁▁▃▁▄▃▃▄▄▃▄▅▆▃▃▄▄▅ █
  34.4 ns       Histogram: log(frequency) by time      40.2 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

So, doing sum! on GPU is 400x times slower than CPU, and is allocating 1.6 Kb memory for a matrix of 4x4. Any comments on this?

I am running Julia 1.7.1 on a Linux server with CPU Intel(R) Xeon(R) Silver 4214 CPU @ 2.20GHz and GPU Quadro RTX 5000.

maleadt commented 2 years ago

Any comments on this?

This isn't what a GPU is for. The launch overhead is around 10us, so this is perfectly expected. You shouldn't be using a GPU for small array operations, like the 4x4 here.

ZongyuLi-umich commented 2 years ago

I see, I did more tests for larger array size, and GPU is faster when the size goes beyond 512 x 512. Thanks!

# Matrix size: 128 x 128
# @benchmark CUDA.@sync sum!(v_ones, M_rand)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  14.448 μs … 153.065 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     15.960 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   16.347 μs ±   2.057 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

     ▁▄▆█▇▇▇▆▅▄▃▁▁                                              
  ▁▃▄███████████████▇▆▆▅▅▄▄▃▃▃▂▂▂▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▃
  14.4 μs         Histogram: frequency by time         22.5 μs <

 Memory estimate: 1.66 KiB, allocs estimate: 32.

# @benchmark sum!(v_cpu, M_cpu)
BenchmarkTools.Trial: 10000 samples with 7 evaluations.
 Range (min … max):  4.481 μs …  12.081 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     4.500 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   4.605 μs ± 498.147 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █    ▄                                                    ▁ ▁
  █▄▃▁▁██▅█▆▄▆▆▅▄▄▅▃▃▄▄▃▃▁▃▁▁▃▄▄▄▄▄▄▄▄▁▁▁▁▁▁▄▄▄▅▄▃▃▁▃▁▁▁▁▅▄▄█ █
  4.48 μs      Histogram: log(frequency) by time      7.73 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

# Matrix size: 512 x 512
# @benchmark CUDA.@sync sum!(v_ones, M_rand)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  24.079 μs … 156.556 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     29.678 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   31.467 μs ±   6.550 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

   ▃█▇▇▆▆▅▄▄▃▃▂▁▁                                               
  ▅████████████████▇▇▆▆▆▅▅▅▅▄▄▄▄▃▃▃▃▃▃▃▃▃▃▂▃▂▂▂▂▂▂▂▁▂▁▁▁▁▁▁▁▁▁ ▄
  24.1 μs         Histogram: frequency by time         53.2 μs <

 Memory estimate: 1.69 KiB, allocs estimate: 34.

# @benchmark sum!(v_cpu, M_cpu)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  35.118 μs … 153.752 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     35.736 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   36.148 μs ±   2.166 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

    ▃█▆                                                         
  ▂▅███▇▃▂▂▂▂▂▃▃▄▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▂▂▂▂▂▁▂▂▂▂▂▂▂▂▂▁▂▂ ▃
  35.1 μs         Histogram: frequency by time         44.4 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

# Matrix size: 1024 x 1024
# @benchmark CUDA.@sync sum!(v_ones, M_rand)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  66.850 μs …  2.575 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     70.251 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   71.134 μs ± 25.368 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

          ▃▅▇█▇▆▃▄▂▂▂▂▁▂▁                                      
  ▁▂▂▃▄▆▇██████████████████▇▆▅▅▄▄▄▄▄▄▄▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▁▁▁▁▁▁▁▁ ▄
  66.8 μs         Histogram: frequency by time        78.9 μs <

 Memory estimate: 1.69 KiB, allocs estimate: 34.

# @benchmark sum!(v_cpu, M_cpu)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  165.620 μs … 514.602 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     169.756 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   170.743 μs ±   6.286 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

    ▆█▃   ▂▅▅▅▄▄▃▃▃▃▃▃▂▂▂▂▂▁▁▁▁ ▁                               ▂
  ▄▇███▄▂▃██████████████████████████▇████▇▆▆▇▇▆▆▆▆▆▄▇▆▅▆▅▆▄▅▄▆▆ █
  166 μs        Histogram: log(frequency) by time        191 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.
Sixzero commented 2 years ago

I am experiencing this issue too.

How come the sum function is so naive, that we can do so much better with not even using a reduce, just looping over the array with each GPU thread?

maleadt commented 2 years ago

How come the sum function is so naive

If you think the current implementation is naive, feel free to take a look and improve it.

Sixzero commented 2 years ago

I am trying to find the declaration of this function, but @edit leads me to this part of the code.

for (fname, _fname, op) in [(:sum,     :_sum,     :add_sum), (:prod,    :_prod,    :mul_prod),
                            (:maximum, :_maximum, :max),     (:minimum, :_minimum, :min)]
    @eval begin
        # User-facing methods with keyword arguments
        @inline ($fname)(a::AbstractArray; dims=:, kw...) = ($_fname)(a, dims; kw...)
        @inline ($fname)(f, a::AbstractArray; dims=:, kw...) = ($_fname)(f, a, dims; kw...)

        # Underlying implementations using dispatch
        ($_fname)(a, ::Colon; kw...) = ($_fname)(identity, a, :; kw...)
        ($_fname)(f, a, ::Colon; kw...) = mapreduce(f, $op, a; kw...)
    end
end

How else could I find the source of the CUDA implementation?

maleadt commented 2 years ago

It's built on the generic mapreduce implementation: https://github.com/JuliaGPU/CUDA.jl/blob/master/src/mapreduce.jl