Open simonbyrne opened 1 year ago
Looks like there were a couple of issues:
Fixing those, and specifying a higher GC frequency fixes the GC pauses https://buildkite.com/clima/climaatmos-target-gpu-simulations/builds/111#018b3c1f-a277-4683-b446-13423f7ea108
However we're still getting stuck in epoll_wait
before the MPI communication starts:
This appears to be https://github.com/JuliaGPU/CUDA.jl/issues/1910. This is fixed in CUDA.jl 5 (https://github.com/JuliaGPU/CUDA.jl/pull/2025), but unfortunately we can't upgrade yet (https://github.com/CliMA/ClimaCore.jl/issues/1500)
Update: we are still seeing some cases where CPU cores are idle, which causes 3ms-6ms delays.
Current plan (discussed with @sriharshakandala @bloops)
I tried using JULIA_EXCLUSIVE=1
, but it gave worse results. My suspicion is that it could be due to hyperthreads, will need to investigate further.
Is there a reproducer for this? Or a job ID that we can add for reference?
We use the blocking by default now: https://github.com/search?q=repo%3ACliMA%2FClimaCore.jl%20blocking&type=code.
Reproducer is on GPU target pipeline
Okay, so here is what I've learnt:
On the Slurm side
--cpus-per-task=n
in both the sbatch
and srun
(the sbatch
ones used to be automatically forwarded, but not any more, see https://groups.google.com/g/slurm-users/c/JQRgrKaKCcw/m/hpZtXOfwEQAJ--cpu-bind=threads
in srun
.On the Julia side
nsys profile
, it can be helpful to leave a core for that as well.One other opportunity for improvement:
Our current DSS operation looks something like
CUDA.synchronize()
MPI.Startall(...)
MPI.Waitall(...)
The problem is that the GPU is completely idle during 3, and during the launch latency of 4:
Instead of synchronizing the whole stream, we could instead synchronize via events:
CUDA.record(send_event)
CUDA.synchronize(send_event)
MPI.Startall(...)
MPI.Waitall(...)
In this way, the internal dss kernels can run during MPI communication.
Oh, and also it appears that thread pinning on clima
is a net negative. It causes occasional ~10-20ms when the OS thread scheduler kicks in:
On the other hand, as long as we use Slurm thread binding (but not process thread pinning) with a sufficient number of threads (in this case, 4 hardware threads assigned to 3 julia threads), we do see occasional very short (20us) pauses, but it then switches to a new hardware thread, having very little net effect (notice the change in color):
I've updated our GPU pipeline in https://github.com/CliMA/ClimaAtmos.jl/pull/2585
https://github.com/CliMA/ClimaTimeSteppers.jl/pull/260 should help with scaling by reducing the number of DSS calls (we'll be eliminating 4 per timestep).
Upgrading CUDA and Adapt, plus https://github.com/JuliaGPU/Adapt.jl/pull/78 will reduce allocations for GPU runs by a factor of ~15, which may help reduce GC pressure. We should be able to reduce the frequency of GC calls after this update.
It seems to primarily be driven by GC. Need to look at memory allocations, and mechanism to synchronize the garbage collector.
cf earlier discussion here https://github.com/CliMA/ClimaAtmos.jl/issues/686