Open Cvikli opened 2 years ago
Or... the preallocation! Damn, I am comparing different operation. :(
I will be back with the right measurement.
This shows a little bit better speed without preallocation:
N = 500; A = CUDA.randn(Float32, N, N); B = CUDA.randn(Float32, N, 1); (@btime CUDA.@sync sum!($B, $A))
N = 1000; A = CUDA.randn(Float32, N, N); B = CUDA.randn(Float32, N, 1); (@btime CUDA.@sync sum!($B, $A))
N = 2000; A = CUDA.randn(Float32, N, N); B = CUDA.randn(Float32, N, 1); (@btime CUDA.@sync sum!($B, $A))
N = 4000; A = CUDA.randn(Float32, N, N); B = CUDA.randn(Float32, N, 1); (@btime CUDA.@sync sum!($B, $A))
N = 6000; A = CUDA.randn(Float32, N, N); B = CUDA.randn(Float32, N, 1); (@btime CUDA.@sync sum!($B, $A))
N = 8000; A = CUDA.randn(Float32, N, N); B = CUDA.randn(Float32, N, 1); (@btime CUDA.@sync sum!($B, $A))
N = 10000; A = CUDA.randn(Float32, N, N); B = CUDA.randn(Float32, N, 1); (@btime CUDA.@sync sum!($B, $A))
N = 12000; A = CUDA.randn(Float32, N, N); B = CUDA.randn(Float32, N, 1); (@btime CUDA.@sync sum!($B, $A))
New results:
41.969 μs (73 allocations: 2.47 KiB)
95.652 μs (152 allocations: 4.94 KiB)
246.136 μs (207 allocations: 6.66 KiB)
921.868 μs (1572 allocations: 49.31 KiB)
2.781 ms (2101 allocations: 65.84 KiB)
5.428 ms (5666 allocations: 177.25 KiB)
8.370 ms (10369 allocations: 324.22 KiB)
12.587 ms (16901 allocations: 528.34 KiB)
But actually still notable difference.
Don't want to mix things, but running the bulitin sum(..., dims=2)
and mean(..., dims=2)
gives nearly exactly the same speed every testcases. Does this help for you to understand what can be the problem?
running the bulitin
sum(..., dims=2)
andmean(..., dims=2)
gives nearly exactly the same speed every testcases
That's expected, right? mean
is just sum
divided by length
.
Yes, definitely. I just wanted to make sure that the sum is really important as there are really multiple aggregation algorithm that can depend on that.
I wonder if it's just a launch configuration issue again, because generally CUDA.jl uses the max threads returned by the occupancy API, likely 1024 threads. Could you compare against that?
You mean test with 64/256/512/1024 threads and compare each testcases?
Yeah. nth=1024
would compare with CUDA.jl's launch configuration.
I got the same results. The code I ran:
nth=1024
N = 500; A = CUDA.randn(Float32, N, N); B = CUDA.randn(Float32, N, 1); (@btime CUDA.@sync @cuda threads=nth blocks=cld(N*N, nth) mysum($B,$A,$N))
N = 1000; A = CUDA.randn(Float32, N, N); B = CUDA.randn(Float32, N, 1); (@btime CUDA.@sync @cuda threads=nth blocks=cld(N*N, nth) mysum($B,$A,$N))
N = 2000; A = CUDA.randn(Float32, N, N); B = CUDA.randn(Float32, N, 1); (@btime CUDA.@sync @cuda threads=nth blocks=cld(N*N, nth) mysum($B,$A,$N))
N = 4000; A = CUDA.randn(Float32, N, N); B = CUDA.randn(Float32, N, 1); (@btime CUDA.@sync @cuda threads=nth blocks=cld(N*N, nth) mysum($B,$A,$N))
N = 6000; A = CUDA.randn(Float32, N, N); B = CUDA.randn(Float32, N, 1); (@btime CUDA.@sync @cuda threads=nth blocks=cld(N*N, nth) mysum($B,$A,$N))
N = 8000; A = CUDA.randn(Float32, N, N); B = CUDA.randn(Float32, N, 1); (@btime CUDA.@sync @cuda threads=nth blocks=cld(N*N, nth) mysum($B,$A,$N))
N = 10000; A = CUDA.randn(Float32, N, N); B = CUDA.randn(Float32, N, 1); (@btime CUDA.@sync @cuda threads=nth blocks=cld(N*N, nth) mysum($B,$A,$N))
N = 12000; A = CUDA.randn(Float32, N, N); B = CUDA.randn(Float32, N, 1); (@btime CUDA.@sync @cuda threads=nth blocks=cld(N*N, nth) mysum($B,$A,$N))
Hi,
my idea would be to turn the summation into a matrix vector multiplication, where the vector consists of ones. I have implemented and benchmarked a few different combinations:
Here is the code:
using CUDA
M_rand = CUDA.rand(Float32, 1000, 1001)
M_cpu = Array(M_rand)
v_ones = CUDA.ones(Float32, 1, size(M_rand)[1])
M_ones = CUDA.ones(Float32, size(M_rand))
function dot_sum(M)
v = CUDA.ones(Float32, 1, size(M)[1])
sum(v * M)
end
function dot_sum_full(M)
v = CUDA.ones(Float32, 1, size(M)[1])
reduced = v * M
v2 = CUDA.ones(Float32, length(reduced))
# Avoid scalar indexing warning
reduce(+, reduced * v2)
end
function dot_sum_prealloc(v, M)
sum(v * M)
end
using LinearAlgebra
@cuassert sum(M_rand) ≈ dot_sum(M_rand) ≈ dot_sum_full(M_rand) ≈ dot_sum_prealloc(v_ones, M_rand) ≈ dot(M_ones, M_rand)
using BenchmarkTools
@benchmark CUDA.@sync dot_sum(M_rand)
@benchmark CUDA.@sync dot_sum_full(M_rand)
@benchmark CUDA.@sync dot_sum_prealloc(v_ones, M_rand)
@benchmark CUDA.@sync sum(M_rand)
@benchmark CUDA.@sync dot(M_ones, M_rand)
@benchmark sum(M_cpu)
And the results
# @benchmark CUDA.@sync dot_sum(M_rand)
BenchmarkTools.Trial: 9906 samples with 1 evaluation.
Range (min … max): 175.332 μs … 262.126 ms ┊ GC (min … max): 0.00% … 9.63%
Time (median): 203.573 μs ┊ GC (median): 0.00%
Time (mean ± σ): 496.464 μs ± 2.749 ms ┊ GC (mean ± σ): 0.51% ± 0.10%
█▅▄▂▁ ▁ ▂▁ ▁
█████▇▆▄▅▄▃▄▆▆▆█▇▇█▇▆▆▆▆▇▇███▇▅▄▄▄▁▁▄▄▅▄▃▃▅▃▄▆▇█▆▆████▇▆▅▃▄▆▆ █
175 μs Histogram: log(frequency) by time 3.79 ms <
Memory estimate: 5.22 KiB, allocs estimate: 112.
# @benchmark CUDA.@sync dot_sum_full(M_rand)
BenchmarkTools.Trial: 9739 samples with 1 evaluation.
Range (min … max): 178.480 μs … 8.318 ms ┊ GC (min … max): 0.00% … 0.00%
Time (median): 199.840 μs ┊ GC (median): 0.00%
Time (mean ± σ): 506.466 μs ± 1.029 ms ┊ GC (mean ± σ): 0.00% ± 0.00%
█▄▂▁ ▁▁ ▁
████▇▆▅▆▇█▇▇▇▆▇███▇▆▅▅▅▅▅▄▄▅▆▅▇███▇▆▆▅▅▅▅▄▅▅▅▄▅▄▆▅▆▆▆▅▄▁▅▆▅ █
178 μs Histogram: log(frequency) by time 6.1 ms <
Memory estimate: 3.62 KiB, allocs estimate: 95.
# @benchmark CUDA.@sync dot_sum_prealloc(v_ones, M_rand)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min … max): 167.418 μs … 157.410 ms ┊ GC (min … max): 0.00% … 11.92%
Time (median): 182.596 μs ┊ GC (median): 0.00%
Time (mean ± σ): 441.800 μs ± 1.753 ms ┊ GC (mean ± σ): 0.42% ± 0.12%
█▄▂ ▁▁ ▁
████▆▅▅▄▄▁▄▁▅▄▇▇▆▇▇▇▆▆▆▆▆▇▇██▇▇▅▅▄▅▅▅▆▅▄▆▇▇▇▆██▇█▇▆▄▄▄▃▁▁▃▄▃▃ █
167 μs Histogram: log(frequency) by time 4.21 ms <
Memory estimate: 5.06 KiB, allocs estimate: 106.
# @benchmark CUDA.@sync sum(M_rand)
BenchmarkTools.Trial: 3969 samples with 1 evaluation.
Range (min … max): 492.841 μs … 12.605 ms ┊ GC (min … max): 0.00% … 0.00%
Time (median): 713.582 μs ┊ GC (median): 0.00%
Time (mean ± σ): 1.245 ms ± 1.029 ms ┊ GC (mean ± σ): 0.00% ± 0.00%
▇▇█▇▆▄▁ ▁▁▁▂▂▁ ▁ ▁▁ ▁▂▃▃▂▂▁ ▁
███████▇█▆▄▆████████▇▆▇██████▇▆▇▆▇███████████▇▃▃▃▅▇█▆▅▄▃▆█▅▇ █
493 μs Histogram: log(frequency) by time 4.24 ms <
Memory estimate: 4.59 KiB, allocs estimate: 82.
# @benchmark CUDA.@sync dot(M_ones, M_rand)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min … max): 248.969 μs … 4.300 ms ┊ GC (min … max): 0.00% … 0.00%
Time (median): 262.834 μs ┊ GC (median): 0.00%
Time (mean ± σ): 473.993 μs ± 704.821 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
█▄▁▂ ▁ ▁ ▁ ▂▁ ▁
████▄█▇▃▄▁▄▃▅▃▅█▇█▇▅▅▅▅▅▅▆▅▄▆▆▁▃▁▄▁▁▁▁▁▁▁▁▁▁▁▁▁▅█▇▄▅▄▃▅▄▅▅▆██ █
249 μs Histogram: log(frequency) by time 3.24 ms <
Memory estimate: 32 bytes, allocs estimate: 2.
# @benchmark sum(M_cpu)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min … max): 104.205 μs … 1.220 ms ┊ GC (min … max): 0.00% … 0.00%
Time (median): 198.371 μs ┊ GC (median): 0.00%
Time (mean ± σ): 229.274 μs ± 94.172 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
▂█▇▇█▇▆▄▃▃▁
▁▁▂█████████████▇▇▇▆▆▆▆▅▆▅▄▅▅▄▄▄▄▄▃▃▃▃▃▂▃▃▂▂▂▂▂▂▂▂▂▂▁▂▁▁▁▁▁▁ ▄
104 μs Histogram: frequency by time 539 μs <
Memory estimate: 16 bytes, allocs estimate: 1.
So I'm getting CPU-like speeds with the implementations, which in my eyes is pretty good for a reduce operation on the GPU. And compared to the existing sum it is ~2.5 faster for the mean, but with a higher standard deviation.
Hi,
I found the function sum!
is running very slow on GPU and allocating more memory than expected.
A minimum working example is
M_rand = CUDA.rand(Float32, 4, 4)
M_cpu = Array(M_rand)
v_ones = CUDA.ones(Float32, 1, 4)
v_cpu = Array(v_ones)
@benchmark CUDA.@sync sum!(v_ones, M_rand)
@benchmark sum!(v_cpu, M_cpu)
# @benchmark CUDA.@sync sum!(v_ones, M_rand)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min … max): 13.715 μs … 104.607 μs ┊ GC (min … max): 0.00% … 0.00%
Time (median): 15.520 μs ┊ GC (median): 0.00%
Time (mean ± σ): 15.997 μs ± 2.174 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
▄▇██▇▆█▅▄▄▃▃▂▁▁
▂▃▇████████████████▇▆▆▆▅▅▅▄▃▄▃▃▃▂▂▂▂▂▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▄
13.7 μs Histogram: frequency by time 22.7 μs <
Memory estimate: 1.66 KiB, allocs estimate: 32.
# @benchmark sum!(v_cpu, M_cpu)
BenchmarkTools.Trial: 10000 samples with 993 evaluations.
Range (min … max): 34.386 ns … 925.473 ns ┊ GC (min … max): 0.00% … 0.00%
Time (median): 34.611 ns ┊ GC (median): 0.00%
Time (mean ± σ): 35.039 ns ± 9.019 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
▇█▅ ▃▄▁ ▂
█████▆▅▅▁▁▁▁▁▆▃▁▁▄▆▇▅▅▇▅▅███▆█▄▆▄▃▄▃▁▃▅█▇▁▁▁▃▁▄▃▃▄▄▃▄▅▆▃▃▄▄▅ █
34.4 ns Histogram: log(frequency) by time 40.2 ns <
Memory estimate: 0 bytes, allocs estimate: 0.
So, doing sum!
on GPU is 400x times slower than CPU, and is allocating 1.6 Kb memory for a matrix of 4x4.
Any comments on this?
I am running Julia 1.7.1 on a Linux server with CPU Intel(R) Xeon(R) Silver 4214 CPU @ 2.20GHz and GPU Quadro RTX 5000.
Any comments on this?
This isn't what a GPU is for. The launch overhead is around 10us, so this is perfectly expected. You shouldn't be using a GPU for small array operations, like the 4x4 here.
I see, I did more tests for larger array size, and GPU is faster when the size goes beyond 512 x 512. Thanks!
# Matrix size: 128 x 128
# @benchmark CUDA.@sync sum!(v_ones, M_rand)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min … max): 14.448 μs … 153.065 μs ┊ GC (min … max): 0.00% … 0.00%
Time (median): 15.960 μs ┊ GC (median): 0.00%
Time (mean ± σ): 16.347 μs ± 2.057 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
▁▄▆█▇▇▇▆▅▄▃▁▁
▁▃▄███████████████▇▆▆▅▅▄▄▃▃▃▂▂▂▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▃
14.4 μs Histogram: frequency by time 22.5 μs <
Memory estimate: 1.66 KiB, allocs estimate: 32.
# @benchmark sum!(v_cpu, M_cpu)
BenchmarkTools.Trial: 10000 samples with 7 evaluations.
Range (min … max): 4.481 μs … 12.081 μs ┊ GC (min … max): 0.00% … 0.00%
Time (median): 4.500 μs ┊ GC (median): 0.00%
Time (mean ± σ): 4.605 μs ± 498.147 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
█ ▄ ▁ ▁
█▄▃▁▁██▅█▆▄▆▆▅▄▄▅▃▃▄▄▃▃▁▃▁▁▃▄▄▄▄▄▄▄▄▁▁▁▁▁▁▄▄▄▅▄▃▃▁▃▁▁▁▁▅▄▄█ █
4.48 μs Histogram: log(frequency) by time 7.73 μs <
Memory estimate: 0 bytes, allocs estimate: 0.
# Matrix size: 512 x 512
# @benchmark CUDA.@sync sum!(v_ones, M_rand)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min … max): 24.079 μs … 156.556 μs ┊ GC (min … max): 0.00% … 0.00%
Time (median): 29.678 μs ┊ GC (median): 0.00%
Time (mean ± σ): 31.467 μs ± 6.550 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
▃█▇▇▆▆▅▄▄▃▃▂▁▁
▅████████████████▇▇▆▆▆▅▅▅▅▄▄▄▄▃▃▃▃▃▃▃▃▃▃▂▃▂▂▂▂▂▂▂▁▂▁▁▁▁▁▁▁▁▁ ▄
24.1 μs Histogram: frequency by time 53.2 μs <
Memory estimate: 1.69 KiB, allocs estimate: 34.
# @benchmark sum!(v_cpu, M_cpu)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min … max): 35.118 μs … 153.752 μs ┊ GC (min … max): 0.00% … 0.00%
Time (median): 35.736 μs ┊ GC (median): 0.00%
Time (mean ± σ): 36.148 μs ± 2.166 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
▃█▆
▂▅███▇▃▂▂▂▂▂▃▃▄▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▂▂▂▂▂▁▂▂▂▂▂▂▂▂▂▁▂▂ ▃
35.1 μs Histogram: frequency by time 44.4 μs <
Memory estimate: 0 bytes, allocs estimate: 0.
# Matrix size: 1024 x 1024
# @benchmark CUDA.@sync sum!(v_ones, M_rand)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min … max): 66.850 μs … 2.575 ms ┊ GC (min … max): 0.00% … 0.00%
Time (median): 70.251 μs ┊ GC (median): 0.00%
Time (mean ± σ): 71.134 μs ± 25.368 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
▃▅▇█▇▆▃▄▂▂▂▂▁▂▁
▁▂▂▃▄▆▇██████████████████▇▆▅▅▄▄▄▄▄▄▄▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▁▁▁▁▁▁▁▁ ▄
66.8 μs Histogram: frequency by time 78.9 μs <
Memory estimate: 1.69 KiB, allocs estimate: 34.
# @benchmark sum!(v_cpu, M_cpu)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min … max): 165.620 μs … 514.602 μs ┊ GC (min … max): 0.00% … 0.00%
Time (median): 169.756 μs ┊ GC (median): 0.00%
Time (mean ± σ): 170.743 μs ± 6.286 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
▆█▃ ▂▅▅▅▄▄▃▃▃▃▃▃▂▂▂▂▂▁▁▁▁ ▁ ▂
▄▇███▄▂▃██████████████████████████▇████▇▆▆▇▇▆▆▆▆▆▄▇▆▅▆▅▆▄▅▄▆▆ █
166 μs Histogram: log(frequency) by time 191 μs <
Memory estimate: 0 bytes, allocs estimate: 0.
I am experiencing this issue too.
How come the sum function is so naive, that we can do so much better with not even using a reduce, just looping over the array with each GPU thread?
How come the sum function is so naive
If you think the current implementation is naive, feel free to take a look and improve it.
I am trying to find the declaration of this function, but @edit leads me to this part of the code.
for (fname, _fname, op) in [(:sum, :_sum, :add_sum), (:prod, :_prod, :mul_prod),
(:maximum, :_maximum, :max), (:minimum, :_minimum, :min)]
@eval begin
# User-facing methods with keyword arguments
@inline ($fname)(a::AbstractArray; dims=:, kw...) = ($_fname)(a, dims; kw...)
@inline ($fname)(f, a::AbstractArray; dims=:, kw...) = ($_fname)(f, a, dims; kw...)
# Underlying implementations using dispatch
($_fname)(a, ::Colon; kw...) = ($_fname)(identity, a, :; kw...)
($_fname)(f, a, ::Colon; kw...) = mapreduce(f, $op, a; kw...)
end
end
How else could I find the source of the CUDA implementation?
It's built on the generic mapreduce implementation: https://github.com/JuliaGPU/CUDA.jl/blob/master/src/mapreduce.jl
Coming from this observation #issue 1323.
Having a code:
Gives us this results:
Possibly: 200-1000% speedup.
Why are we faster with "mysum"? Is it the CUDA.const? Or the specialisation on dims=2? Or we don't see something?