Closed navidcy closed 7 months ago
Is it possible to perform long-running simulations on the GPU when there are allocations? Can GPU garbage collection keep up?
I'm not sure about that. But I'm also bit confused regarding where are the allocations coming from.
random_uniform(T, size(forcing_spectrum)...))
Calling CUDA.randn(sz...)
allocates an array, and then populates it with random numbers.
random_uniform(T, size(forcing_spectrum)...))
Calling
CUDA.randn(sz...)
allocates an array, and then populates it with random numbers.
Oh yeah... that was the "allocating" version I suggested in the issue. The PR doesn't have that version, I just put it here for comparison.
But still using randn!
you see there are some allocations... Those I don't understand where they come from.
julia> @btime CUDA.@sync calcF!($vars.Fh, $sol, 0.0, $clock, $vars, $params, $grid) 19.923 μs (32 allocations: 2.02 KiB)
random_uniform(T, size(forcing_spectrum)...))
Calling
CUDA.randn(sz...)
allocates an array, and then populates it with random numbers.Oh yeah... that was the "allocating" version I suggested in the issue. The PR doesn't have that version, I just put it here for comparison.
But still using
randn!
you see there are some allocations... Those I don't understand where they come from.julia> @btime CUDA.@sync calcF!($vars.Fh, $sol, 0.0, $clock, $vars, $params, $grid) 19.923 μs (32 allocations: 2.02 KiB)
Did you look into the code for randn!
? You'd probably find your answer quickly.
I actually didn't :(
omg, I figured it out!
randn!
calls inplace_pow2
which, if is not provided with an array of length that is a power of 2, then it creates a new array that is of size the next power of 2 --- thus, it allocates!!
If we have arrays that have length that is a power of 2 then there is no allocations:
julia> using CUDA, Random
julia> A = CUDA.zeros(1024, 1024);
julia> @btime Random.randn!($A);
2.417 μs (0 allocations: 0 bytes)
julia> A = CUDA.zeros(1024, 1025);
julia> @btime Random.randn!($A);
14.119 μs (10 allocations: 352 bytes)
Did you look into the code for
randn!
? You'd probably find your answer quickly.
You were right. In my head this was like an impossible task but it actually took me less than 10 minutes.
Nice work 🕵️♂️
This forcing implementation ensures non-allocating
calcF!
methods both for CPU and GPU.Closes #350
Few benchmarks:
Thus, this PR is 1.5-2x faster than the solution originally proposed in #350 and with less allocations.