ClimaCore 0.14.13 slows down some GPU simulations and degrades scaling

Sbozzolo commented 2 months ago

Coupled diagnostic EDMF with 16 horizontal elements (config: config/longrun_configs/amip_target_diagedmf.yml)

ClimaCore 0.14.13 with ClimaAtmos 0.27.5: SYPD ~0.86

ClimaCore 0.14.12 with ClimaAtmos 0.27.5: SYPD ~0.96

ClimaCore 0.14.13 with ClimaAtmos 0.27.4: SYPD ~0.91

ClimaCore 0.14.13 with ClimaAtmos 0.27.5 but no aerosol: SYPD ~0.92

4 GPUs atmos only EDMF with 30 horizontal elements (config config/benchmark_configs/climaatmos_diagedmf.yml in coupler):

ClimaCore 0.14.12 with ClimaAtmos 0.27.4: SYPD ~1.06

ClimaCore 0.14.13 with ClimaAtmos 0.27.5: SYPD ~0.88

Degradation in scaling in ClimaCore 0.14.13 (the job has two builds one with 1 GPU, the other with 4. SYPD with 4 is only 25 % more) used to be more like 50%

charleskawczynski commented 2 months ago

I think we need to narrow the scope of this issue. First, can we remove the items whose simulations are broken/canceled? Relying on progress info during a simulation is not precise.

charleskawczynski commented 2 months ago

Doing that, leaves us with

ClimaCore 0.14.12 with ClimaAtmos 0.27.4: SYPD ~1.06

ClimaCore 0.14.13 with ClimaAtmos 0.27.5: SYPD ~0.88

Which updates two things at once. I'm not sure that we have good measurements from this to fairly compare.

charleskawczynski commented 2 months ago

The builds I ran before merging each change were:

It is peculiar, though. The results do show some variation per function call. In particular, the build where the dss was refactored was significantly better, but then less so after it was merged.

Sbozzolo commented 2 months ago

I think we need to narrow the scope of this issue. First, can we remove the items whose simulations are broken/canceled? Relying on progress info during a simulation is not precise.

There's a 28 minute difference in walltime to reach simulation_time = "13 weeks, 34 minutes" between the coupled run with ClimaCore 0.14.2 and ClimaCore 0.14.3. Even if the simulation was later canceld or failed, that's still evidence of a slowdown. (That's ~7% difference, which is consistent with what we see in SYPD).

In any case, this build is with ClimaCore 0.14.2 and Atmos 0.27.5, 30 elements, runs to completion, and has SYPD of 1.05.

charleskawczynski commented 2 months ago

Even if the simulation was later canceled or failed, that's still evidence of a slowdown. (That's ~7% difference, which is consistent with what we see in SYPD).

Yes, but those measurements are not precise. For example, comparing the last two links at the first step shows the opposite conclusion:

ClimaCore 0.14.12 with ClimaAtmos 0.27.4: sypd = 0.272

ClimaCore 0.14.13 with ClimaAtmos 0.27.5: sypd = 0.369

In any case, this build is with ClimaCore 0.14.2 and Atmos 0.27.5, 30 elements, runs to completion, and has SYPD of 1.05.

Thank you for adding this build, this looks like a good and fair comparison. My best guess is that the launch configuration is not ideal with 4 gpus, since the gpus are probably not fed enough data. This aligns with the observation we saw with lower resolution. It's unfortunate. Perhaps we can try https://github.com/maleadt/StaticCartesian.jl and we can revert to using a linear launch configuration. There is a chance that the issue is with DSS, but I'd be surprised if that were the case since (IIRC) that didn't change the launch configuration.

Sbozzolo commented 2 months ago

Even if the simulation was later canceled or failed, that's still evidence of a slowdown. (That's ~7% difference, which is consistent with what we see in SYPD).

Yes, but those measurements are not precise. For example, comparing the last two links at the first step shows the opposite conclusion:

ClimaCore 0.14.12 with ClimaAtmos 0.27.4: sypd = 0.272

ClimaCore 0.14.13 with ClimaAtmos 0.27.5: sypd = 0.369

Yes, of course, the first few steps are not significant and they are not representative (we are not even going through the callbacks). But when we are 13 weeks into the simulation, the statistical noise is greatly reduced and I think that can be trusted as a predictor for which build is faster.

In any case, this build is with ClimaCore 0.14.2 and Atmos 0.27.5, 30 elements, runs to completion, and has SYPD of 1.05.

Thank you for adding this build, this looks like a good and fair comparison. My best guess is that the launch configuration is not ideal with 4 gpus, since the gpus are probably not fed enough data. This aligns with the observation we saw with lower resolution. It's unfortunate. Perhaps we can try https://github.com/maleadt/StaticCartesian.jl and we can revert to using a linear launch configuration. There is a chance that the issue is with DSS, but I'd be surprised if that were the case since (IIRC) that didn't change the launch configuration.

Some of the configurations people care about got slower and it would be good to determine what is the root cause and understand the tradeoffs. If the issue is with the benchmarks, let's fix them so that they are reflective of what we care about.

Our science requirement is to be able to run a multi-year AMIP simulation with a reasonably fast turnaround time.

charleskawczynski commented 1 month ago

Actually, I posted the wrong builds that I mentioned that I looked at before merging (cc @szy21). The correct before and after builds are these two:

Diagnostic edmf went from 977.884 ms to 827.044 ms per step at our target resolution with 1 gpu. Moist held suarez on 4 gpus didn't run in the older build, but if we compare with https://buildkite.com/clima/climaatmos-target-gpu-simulations/builds/336#0191be1d-8c48-4643-8949-5fa0a0c72df0, it also improved. So, I'm now wondering if this issue is coupler-specific.

charleskawczynski commented 1 month ago

There is still some sort of regression per the mentioned build, in that the coupler is now slower, it could be that kernels slowed down that are not exercised in ClimaCore/ClimaAtmos.

Sbozzolo commented 1 month ago

Another thought I had (but I hope this won't be the case) is that the atmos GPU target pipelines run under nsight. We already know that nsight adds some overhead (that's why we run with multiple threads), and that nsight with our jobs has some problems as seen from the jobs that fail waiting for some response. Nsight has also changed version between build 314 and 339 (from 2024.2.1 to 2024.4.1, and now we are at 2024.5.1) and I wonder if different versions of nsight/different launch configurations lead to different overhead. I hope that's not the case and I don't think that's the case because our dry baro wave in our longrun and the target gpu run have the same wall_time_per_timestep, but maybe this is different for EDMF/MPI.

In any case, I think it'd be good to fix all the nsight problems on clima and get a rough assesment of its impact on measuraments.

More generally, we don't have a comparison between released versions of our packages in the target atmos GPU pipeline, so we don't have clear comparison between ClimaCore 0.14.12 with ClimaCore 0.14.13 in ClimaAtmos 0.27.5 (the versions that are used by the coupler).

Finally, there's a minor launch configuration difference between the atmos jobs in coupler (that don't use ClimaCoupler but still sees the slow down) and atmos: the benchmark jobs in coupler are not binding the jobs to threads.

      - label: "GPU ClimaAtmos with diagnostic EDMF"
        key: "gpu_climaatmos_diagedmf"
        command: "srun julia --threads=3 --color=yes --project=test/ test/component_model_tests/climaatmos_standalone/atmos_driver.jl --config_file $BENCHMARK_CONFIG_PATH/climaatmos_diagedmf.yml --job_id gpu_climaatmos_diagedmf"
        artifact_paths: "experiments/ClimaEarth/output/climaatmos/gpu_climaatmos_diagedmf_artifacts/*"
        env:
          CLIMACOMMS_CONTEXT: "MPI"
          CLIMACOMMS_DEVICE: "CUDA"
        agents:
          slurm_gpus_per_task: 1
          slurm_cpus_per_task: 4
          slurm_ntasks: 4
          slurm_mem: 16GB

(--cpu-bind=threads is missing)

This hasn't changed, and hopefully the performance improvements due to ClimaCore do not depend on this configuration.

szy21 commented 1 month ago

There is still some sort of regression per the mentioned build, in that the coupler is now slower, it could be that kernels slowed down that are not exercised in ClimaCore/ClimaAtmos.

The slowdown is not only in the coupler, it also shows up in the climaatmos longrun. ClimaAtmos with old ClimaCore: https://buildkite.com/clima/climaatmos-gpulongruns/builds/387#0191ca6d-b452-426a-8a78-94fcfc841711 ClimaAtmos with new ClimaCore: https://buildkite.com/clima/climaatmos-gpulongruns/builds/389#0191e200-dd62-4fa0-bcb3-559d8e87ddca There is ~10% slowdown in the aquaplanet with diagnostic EDMF longrun. This is with h_elem=16 though.

charleskawczynski commented 1 month ago

Ok, that's helpful. So, it seems like some jobs got faster, others slowed down. Maybe the change in the launch configuration resulted in some speedups and some slowdowns.

charleskawczynski commented 2 weeks ago

(strong scaling, due to resolution changes)

CliMA / ClimaCore.jl

ClimaCore 0.14.13 slows down some GPU simulations and degrades scaling #1993