CliMA / ClimaAtmos.jl

ClimaAtmos.jl is a library for building atmospheric circulation models that is designed from the outset to leverage data assimilation and machine learning tools. We welcome contributions!
Apache License 2.0
84 stars 17 forks source link

Investigate causes of poor scaling in multi-GPU runs #2222

Open simonbyrne opened 1 year ago

simonbyrne commented 1 year ago

It seems to primarily be driven by GC. Need to look at memory allocations, and mechanism to synchronize the garbage collector.

cf earlier discussion here https://github.com/CliMA/ClimaAtmos.jl/issues/686

simonbyrne commented 1 year ago

Looks like there were a couple of issues:

  1. I didn't request an extra CPU core for the profiler
  2. I didn't request enough memory, so GC was getting triggered more often.

Fixing those, and specifying a higher GC frequency fixes the GC pauses https://buildkite.com/clima/climaatmos-target-gpu-simulations/builds/111#018b3c1f-a277-4683-b446-13423f7ea108

However we're still getting stuck in epoll_wait before the MPI communication starts:

Screenshot 2023-10-17 at 10 33 57 AM

This appears to be https://github.com/JuliaGPU/CUDA.jl/issues/1910. This is fixed in CUDA.jl 5 (https://github.com/JuliaGPU/CUDA.jl/pull/2025), but unfortunately we can't upgrade yet (https://github.com/CliMA/ClimaCore.jl/issues/1500)

simonbyrne commented 10 months ago

Update: we are still seeing some cases where CPU cores are idle, which causes 3ms-6ms delays.

Current plan (discussed with @sriharshakandala @bloops)

simonbyrne commented 10 months ago

Notes on thread pinning:

simonbyrne commented 10 months ago

I tried using JULIA_EXCLUSIVE=1, but it gave worse results. My suspicion is that it could be due to hyperthreads, will need to investigate further.

charleskawczynski commented 9 months ago

Is there a reproducer for this? Or a job ID that we can add for reference?

charleskawczynski commented 9 months ago

We use the blocking by default now: https://github.com/search?q=repo%3ACliMA%2FClimaCore.jl%20blocking&type=code.

charleskawczynski commented 9 months ago

Reproducer is on GPU target pipeline

simonbyrne commented 9 months ago

Okay, so here is what I've learnt:

On the Slurm side

On the Julia side

simonbyrne commented 9 months ago

One other opportunity for improvement:

Our current DSS operation looks something like

  1. launch fill send buffer kernels
  2. CUDA.synchronize()
  3. MPI.Startall(...)
  4. launch internal dss kernels
  5. MPI.Waitall(...)
  6. launch exterior kernels

The problem is that the GPU is completely idle during 3, and during the launch latency of 4:

Screenshot 2024-01-26 at 10 44 46 AM

Instead of synchronizing the whole stream, we could instead synchronize via events:

  1. launch fill send buffer kernels
  2. CUDA.record(send_event)
  3. launch internal dss kernels
  4. CUDA.synchronize(send_event)
  5. MPI.Startall(...)
  6. MPI.Waitall(...)
  7. launch exterior kernels

In this way, the internal dss kernels can run during MPI communication.

simonbyrne commented 9 months ago

Oh, and also it appears that thread pinning on clima is a net negative. It causes occasional ~10-20ms when the OS thread scheduler kicks in:

Screenshot 2024-01-25 at 9 16 34 PM

On the other hand, as long as we use Slurm thread binding (but not process thread pinning) with a sufficient number of threads (in this case, 4 hardware threads assigned to 3 julia threads), we do see occasional very short (20us) pauses, but it then switches to a new hardware thread, having very little net effect (notice the change in color):

Screenshot 2024-01-26 at 10 05 14 AM

I've updated our GPU pipeline in https://github.com/CliMA/ClimaAtmos.jl/pull/2585

charleskawczynski commented 8 months ago

https://github.com/CliMA/ClimaTimeSteppers.jl/pull/260 should help with scaling by reducing the number of DSS calls (we'll be eliminating 4 per timestep).

charleskawczynski commented 8 months ago

Upgrading CUDA and Adapt, plus https://github.com/JuliaGPU/Adapt.jl/pull/78 will reduce allocations for GPU runs by a factor of ~15, which may help reduce GC pressure. We should be able to reduce the frequency of GC calls after this update.