Investigate causes of poor scaling in multi-GPU runs

CliMA / ClimaAtmos.jl

ClimaAtmos.jl is a library for building atmospheric circulation models that is designed from the outset to leverage data assimilation and machine learning tools. We welcome contributions!

Apache License 2.0

84 stars 17 forks source link

Investigate causes of poor scaling in multi-GPU runs #2222

Open simonbyrne opened 1 year ago

simonbyrne commented 1 year ago

It seems to primarily be driven by GC. Need to look at memory allocations, and mechanism to synchronize the garbage collector.

cf earlier discussion here https://github.com/CliMA/ClimaAtmos.jl/issues/686

simonbyrne commented 1 year ago

Looks like there were a couple of issues:

I didn't request an extra CPU core for the profiler
I didn't request enough memory, so GC was getting triggered more often.

Fixing those, and specifying a higher GC frequency fixes the GC pauses https://buildkite.com/clima/climaatmos-target-gpu-simulations/builds/111#018b3c1f-a277-4683-b446-13423f7ea108

However we're still getting stuck in epoll_wait before the MPI communication starts:

This appears to be https://github.com/JuliaGPU/CUDA.jl/issues/1910. This is fixed in CUDA.jl 5 (https://github.com/JuliaGPU/CUDA.jl/pull/2025), but unfortunately we can't upgrade yet (https://github.com/CliMA/ClimaCore.jl/issues/1500)

simonbyrne commented 10 months ago

Update: we are still seeing some cases where CPU cores are idle, which causes 3ms-6ms delays.

Current plan (discussed with @sriharshakandala @bloops)

See if we have the same issue on the HPC cluster?
Try disabling GC altogether?
Investigate thread pinning, see ThreadPinning.jl
Split hyperdiffusion across multiple threads/streams

simonbyrne commented 10 months ago

Notes on thread pinning:

simonbyrne commented 10 months ago

I tried using JULIA_EXCLUSIVE=1, but it gave worse results. My suspicion is that it could be due to hyperthreads, will need to investigate further.

charleskawczynski commented 9 months ago

Is there a reproducer for this? Or a job ID that we can add for reference?

charleskawczynski commented 9 months ago

We use the blocking by default now: https://github.com/search?q=repo%3ACliMA%2FClimaCore.jl%20blocking&type=code.

charleskawczynski commented 9 months ago

Reproducer is on GPU target pipeline

simonbyrne commented 9 months ago

Okay, so here is what I've learnt:

On the Slurm side

you need to specify --cpus-per-task=n in both the sbatch and srun (the sbatch ones used to be automatically forwarded, but not any more, see https://groups.google.com/g/slurm-users/c/JQRgrKaKCcw/m/hpZtXOfwEQAJ
you need to specify --cpu-bind=threads in srun.

On the Julia side

due to https://github.com/JuliaLang/julia/issues/50702, you need to specify fewer Julia threads than CPUs
if using nsys profile, it can be helpful to leave a core for that as well.

simonbyrne commented 9 months ago

One other opportunity for improvement:

Our current DSS operation looks something like

launch fill send buffer kernels
CUDA.synchronize()
MPI.Startall(...)
launch internal dss kernels
MPI.Waitall(...)
launch exterior kernels

The problem is that the GPU is completely idle during 3, and during the launch latency of 4:

Instead of synchronizing the whole stream, we could instead synchronize via events:

launch fill send buffer kernels
CUDA.record(send_event)
launch internal dss kernels
CUDA.synchronize(send_event)
MPI.Startall(...)
MPI.Waitall(...)
launch exterior kernels

In this way, the internal dss kernels can run during MPI communication.

simonbyrne commented 9 months ago

Oh, and also it appears that thread pinning on clima is a net negative. It causes occasional ~10-20ms when the OS thread scheduler kicks in:

On the other hand, as long as we use Slurm thread binding (but not process thread pinning) with a sufficient number of threads (in this case, 4 hardware threads assigned to 3 julia threads), we do see occasional very short (20us) pauses, but it then switches to a new hardware thread, having very little net effect (notice the change in color):

I've updated our GPU pipeline in https://github.com/CliMA/ClimaAtmos.jl/pull/2585

charleskawczynski commented 8 months ago

https://github.com/CliMA/ClimaTimeSteppers.jl/pull/260 should help with scaling by reducing the number of DSS calls (we'll be eliminating 4 per timestep).

charleskawczynski commented 8 months ago

Upgrading CUDA and Adapt, plus https://github.com/JuliaGPU/Adapt.jl/pull/78 will reduce allocations for GPU runs by a factor of ~15, which may help reduce GC pressure. We should be able to reduce the frequency of GC calls after this update.