FourierFlows / GeophysicalFlows.jl

Geophysical fluid dynamics pseudospectral solvers with Julia and FourierFlows.jl.
https://fourierflows.github.io/GeophysicalFlowsDocumentation/stable/
MIT License
155 stars 32 forks source link

Use `randn!` for stochastic forcing implementations #351

Closed navidcy closed 7 months ago

navidcy commented 7 months ago

This forcing implementation ensures non-allocating calcF! methods both for CPU and GPU.

Closes #350

Few benchmarks:

# CPU with Fh .= sqrt.(forcing_spectrum) .* cis.(2π * random_uniform(T, size(forcing_spectrum)...)) ./ sqrt(clock.dt)
julia> @btime calcF!($vars.Fh, $sol, 0.0, $clock, $vars, $params, $grid)
  239.890 μs (10 allocations: 130.23 KiB)

# GPU with Fh .= sqrt.(forcing_spectrum) .* cis.(2π * random_uniform(T, size(forcing_spectrum)...)) ./ sqrt(clock.dt)
julia> @btime CUDA.@sync calcF!($vars.Fh, $sol, 0.0, $clock, $vars, $params, $grid)
  26.709 μs (49 allocations: 2.59 KiB)

# randn! on CPU
julia> @btime calcF!($vars.Fh, $sol, 0.0, $clock, $vars, $params, $grid)
  122.298 μs (4 allocations: 96 bytes)

# randn! on GPU
julia> @btime CUDA.@sync calcF!($vars.Fh, $sol, 0.0, $clock, $vars, $params, $grid)
  19.923 μs (32 allocations: 2.02 KiB)

Thus, this PR is 1.5-2x faster than the solution originally proposed in #350 and with less allocations.

navidcy commented 7 months ago

Is it possible to perform long-running simulations on the GPU when there are allocations? Can GPU garbage collection keep up?

I'm not sure about that. But I'm also bit confused regarding where are the allocations coming from.

glwagner commented 7 months ago
random_uniform(T, size(forcing_spectrum)...))

Calling CUDA.randn(sz...) allocates an array, and then populates it with random numbers.

navidcy commented 7 months ago
random_uniform(T, size(forcing_spectrum)...))

Calling CUDA.randn(sz...) allocates an array, and then populates it with random numbers.

Oh yeah... that was the "allocating" version I suggested in the issue. The PR doesn't have that version, I just put it here for comparison.

But still using randn! you see there are some allocations... Those I don't understand where they come from.

julia> @btime CUDA.@sync calcF!($vars.Fh, $sol, 0.0, $clock, $vars, $params, $grid)
  19.923 μs (32 allocations: 2.02 KiB)
glwagner commented 7 months ago
random_uniform(T, size(forcing_spectrum)...))

Calling CUDA.randn(sz...) allocates an array, and then populates it with random numbers.

Oh yeah... that was the "allocating" version I suggested in the issue. The PR doesn't have that version, I just put it here for comparison.

But still using randn! you see there are some allocations... Those I don't understand where they come from.

julia> @btime CUDA.@sync calcF!($vars.Fh, $sol, 0.0, $clock, $vars, $params, $grid)
  19.923 μs (32 allocations: 2.02 KiB)

Did you look into the code for randn!? You'd probably find your answer quickly.

navidcy commented 7 months ago

I actually didn't :(

navidcy commented 7 months ago

omg, I figured it out!

navidcy commented 7 months ago

randn! calls inplace_pow2 which, if is not provided with an array of length that is a power of 2, then it creates a new array that is of size the next power of 2 --- thus, it allocates!!

https://github.com/JuliaGPU/CUDA.jl/blob/bb49887198f258ffcb186d81df4a787453428b38/lib/curand/random.jl#L83-L111

If we have arrays that have length that is a power of 2 then there is no allocations:

julia> using CUDA, Random

julia> A = CUDA.zeros(1024, 1024);

julia> @btime Random.randn!($A);
  2.417 μs (0 allocations: 0 bytes)

julia> A = CUDA.zeros(1024, 1025);

julia> @btime Random.randn!($A);
  14.119 μs (10 allocations: 352 bytes)
navidcy commented 7 months ago

Did you look into the code for randn!? You'd probably find your answer quickly.

You were right. In my head this was like an impossible task but it actually took me less than 10 minutes.

glwagner commented 7 months ago

Nice work 🕵️‍♂️