simonbyrne commented 2 years ago

Memory allocations are currently cause performance issues as they trigger the garbage collector. This is worse when using MPI, as it can be triggered at different times in different processes, causing the communication to get out of sync. For example, this profile:

You can see that the GC pause (in red) on the top process means that the MPI_Waitall on the bottom process takes longer (as it is waiting on data to be sent from the top process). This effect will get worse at higher core counts, hurting scaling.

Potential solutions

Minimize memory allocations

The memory allocations seem to be caused by a few things:

[x] LU factorization in the linear solver (#648)
[x] Use of Refs inside bycolumn closures. For some reason these aren't being elided. The easiest fix seems to be to use an immutable object as a scalar container for broadcasting (will require some changes to ClimaCore).

Synchronize calls to the garbage collector

[ ] This gets a little bit tricky, but we can disable Julia's GC (GC.enable(false)), and then periodically call GC.gc() manually. We would need to experiment with calling full vs incremental GC (see https://bkamins.github.io/julialang/2021/06/11/vecvec.html)

tapios commented 2 years ago

I suggest you focus on a fully explicit model first, before dealing with the implicit part (linear solver).

charleskawczynski commented 2 years ago

Our allocation table in #836 for each step! looks like:

┌───────────────────────────────────────────────────────────────────────┬─────────────┬───────────────┐
│ <file>:<line number>                                                  │ Allocations │ Allocations % │
│                                                                       │   (bytes)   │    (xᵢ/∑x)    │
├───────────────────────────────────────────────────────────────────────┼─────────────┼───────────────┤
│ ClimaCore/jUYzD/src/Fields/broadcast.jl:82                            │   1853472   │      84       │
│ ClimaAtmos.jl/examples/hybrid/schur_complement_W.jl:248               │   221184    │      10       │
│ ClimaAtmos.jl/examples/hybrid/sphere/baroclinic_wave_utilities.jl:171 │    56000    │       3       │
│ ClimaAtmos.jl/examples/hybrid/staggered_nonhydrostatic_model.jl:780   │    55296    │       3       │
│ SciMLBase/chsnh/src/scimlfunctions.jl:1608                            │    8064     │       0       │
│ OrdinaryDiffEq/QXAKd/src/perform_step/rosenbrock_perform_step.jl:43   │    5376     │       0       │
│ OrdinaryDiffEq/QXAKd/src/perform_step/rosenbrock_perform_step.jl:69   │    4032     │       0       │
│ ClimaAtmos.jl/examples/hybrid/sphere/baroclinic_wave_utilities.jl:435 │    2720     │       0       │
│ OrdinaryDiffEq/QXAKd/src/perform_step/rosenbrock_perform_step.jl:65   │    2688     │       0       │
│ ClimaCore/jUYzD/src/Fields/fieldvector.jl:228                         │    1344     │       0       │
│ ClimaAtmos.jl/examples/hybrid/sphere/baroclinic_wave_utilities.jl:121 │     256     │       0       │
└───────────────────────────────────────────────────────────────────────┴─────────────┴───────────────┘

I looked at each of these lines, and here's the summary:

[x] linsolve! needs to be slightly modified to fix allocations due to broadcasting Refs for the identity. I think this is slightly different than other examples and I have an idea of what form is needed to fix this.
[x] vertical∫_col (see #714)
[x] to_scalar_coefs, which uses map, seems to be allocating

[x] SurfaceFluxes call (not sure what's wrong here):

        surface_conditions .=
            constant_T_saturated_surface_conditions.(
                T_sfc,
                Spaces.level(ᶜts, 1),
                Geometry.UVVector.(Spaces.level(Y.c.uₕ, 1)),
                Fields.Field(ᶜz_interior, axes(z_bottom)),
                Fields.Field(z_surface, axes(z_bottom)),
                Cd,
                Ch,
                params,
            )

[ ] Surprisingly, Base.fill!(dest::FieldVector, value) = dest .= value. Maybe return nothing missing?
[x] p_sfc = reshape(parent(p_sfc), (1, p_int_size[2:end]...)) reshape probably incurs allocations

[ ] Upstream allocations:

Lines                                                                allocs (bytes)
SciMLBase/chsnh/src/scimlfunctions.jl:1608                           8064
OrdinaryDiffEq/QXAKd/src/perform_step/rosenbrock_perform_step.jl:43  5376
OrdinaryDiffEq/QXAKd/src/perform_step/rosenbrock_perform_step.jl:69  4032
OrdinaryDiffEq/QXAKd/src/perform_step/rosenbrock_perform_step.jl:65  2688

The OrdinaryDiffEq lines point to:

The first one may very well be due to the call to fill!, the other two seem to be related to @.. on FieldVectors.

The SciMLBase line points to:

https://github.com/SciML/SciMLBase.jl/blob/e1e58d73714f09046600f3c877dbe4b168935ebd/src/scimlfunctions.jl#L1670

Now that I've looked a bit more closely, there are still a few places where we are allocating fields (in, e.g., held_suarez_tendency!)

charleskawczynski commented 2 years ago

Using the same allocation script, out main branch has the following allocs table:

[ Info: allocs_perf_target_unthreaded: 1 unique allocating sites, 43008 total bytes
┌─────────────────────────────────────────────────────────────────────┬─────────────┬───────────────┐
│ <file>:<line number>                                                │ Allocations │ Allocations % │
│                                                                     │   (bytes)   │    (xᵢ/∑x)    │
├─────────────────────────────────────────────────────────────────────┼─────────────┼───────────────┤
│ ClimaAtmos.jl/examples/hybrid/staggered_nonhydrostatic_model.jl:330 │    43008    │      100      │
└─────────────────────────────────────────────────────────────────────┴─────────────┴───────────────┘

And this line points to an @nvtx macro, which I think is mis-attributed. I've added a PR (#894) that widens the allocation monitoring to more packages (including ClimaTimeSteppers), and this is the updated table:

[ Info: allocs_perf_target_unthreaded: 20 unique allocating sites, 10918647 total bytes (truncated)
┌─────────────────────────────────────────────────────────────────────┬─────────────┬───────────────┐
│ <file>:<line number>                                                │ Allocations │ Allocations % │
│                                                                     │   (bytes)   │    (xᵢ/∑x)    │
├─────────────────────────────────────────────────────────────────────┼─────────────┼───────────────┤
│ ClimaTimeSteppers/y3D2E/src/ClimaTimeSteppers.jl:44                 │   5840151   │      53       │
│ ClimaTimeSteppers/y3D2E/src/solvers/newtons_method.jl:46            │   3014144   │      28       │
│ ClimaTimeSteppers/y3D2E/src/integrators.jl:147                      │   1510496   │      14       │
│ ClimaTimeSteppers/y3D2E/src/solvers/imex_ark.jl:398                 │   341440    │       3       │
│ ClimaTimeSteppers/y3D2E/src/integrators.jl:54                       │    62400    │       1       │
│ ClimaTimeSteppers/y3D2E/src/solvers/imex_ark.jl:172                 │    28480    │       0       │
│ ClimaTimeSteppers/y3D2E/src/solvers/imex_ark.jl:159                 │    20528    │       0       │
│ ClimaTimeSteppers/y3D2E/src/solvers/imex_ark.jl:257                 │    16128    │       0       │
│ ClimaTimeSteppers/y3D2E/src/solvers/newtons_method.jl:60            │    12096    │       0       │
│ ClimaTimeSteppers/y3D2E/src/solvers/imex_ark.jl:143                 │    11392    │       0       │
│ ClimaTimeSteppers/y3D2E/src/solvers/imex_ark.jl:219                 │    10592    │       0       │
│ ClimaTimeSteppers/y3D2E/src/solvers/imex_ark.jl:167                 │    9360     │       0       │
│ ClimaTimeSteppers/y3D2E/src/solvers/imex_ark.jl:351                 │    7328     │       0       │
│ ClimaTimeSteppers/y3D2E/src/solvers/imex_ark.jl:227                 │    7280     │       0       │
│ ClimaTimeSteppers/y3D2E/src/solvers/imex_ark.jl:166                 │    6240     │       0       │
│ ClimaAtmos.jl/examples/hybrid/staggered_nonhydrostatic_model.jl:344 │    5376     │       0       │
│ ClimaTimeSteppers/y3D2E/src/solvers/imex_ark.jl:225                 │    4160     │       0       │
│ ClimaTimeSteppers/y3D2E/src/solvers/imex_ark.jl:185                 │    4096     │       0       │
│ ClimaTimeSteppers/y3D2E/src/solvers/imex_ark.jl:211                 │    3504     │       0       │
│ ClimaTimeSteppers/y3D2E/src/solvers/imex_ark.jl:216                 │    3456     │       0       │
└─────────────────────────────────────────────────────────────────────┴─────────────┴───────────────┘

Sometimes (I'm not sure when / how despite using Profile.clear() and Profile.clear_malloc_data()), Coverage picks up allocations of load times, and so we can ignore those line. However, 351 is inside step!, and is the leading allocator over anything in ClimaAtmos.

charleskawczynski commented 2 years ago

This build seems to have properly filtered out the load times:

[ Info: allocs_perf_target_unthreaded: 10 unique allocating sites, 410944 total bytes
┌─────────────────────────────────────────────────────────────────────┬─────────────┬───────────────┐
│ <file>:<line number>                                                │ Allocations │ Allocations % │
│                                                                     │   (bytes)   │    (xᵢ/∑x)    │
├─────────────────────────────────────────────────────────────────────┼─────────────┼───────────────┤
│ ClimaTimeSteppers/y3D2E/src/solvers/imex_ark.jl:398                 │   338464    │      82       │
│ ClimaAtmos.jl/examples/hybrid/staggered_nonhydrostatic_model.jl:330 │    43008    │      10       │
│ ClimaTimeSteppers/y3D2E/src/solvers/imex_ark.jl:257                 │    16128    │       4       │
│ ClimaTimeSteppers/y3D2E/src/solvers/newtons_method.jl:60            │    12096    │       3       │
│ ClimaTimeSteppers/y3D2E/src/integrators.jl:103                      │     464     │       0       │
│ ClimaTimeSteppers/y3D2E/src/solvers/imex_ark.jl:384                 │     448     │       0       │
│ ClimaTimeSteppers/y3D2E/src/integrators.jl:101                      │     144     │       0       │
│ ClimaTimeSteppers/y3D2E/src/integrators.jl:106                      │     96      │       0       │
│ ClimaTimeSteppers/y3D2E/src/integrators.jl:102                      │     64      │       0       │
│ ClimaTimeSteppers/y3D2E/src/integrators.jl:138                      │     32      │       0       │
└─────────────────────────────────────────────────────────────────────┴─────────────┴───────────────┘

This points to ClimaTimeSteppers 0.2.4.

charleskawczynski commented 2 years ago

Opened CTS#68

simonbyrne commented 2 years ago

The main remaining item is making the GC deterministic. We had to revert the initial attempt at this in #821, as the heuristic it was using wasn't correct: it seems that the values reported by Sys.free_memory and Sys.total_memory don't take into account the cgroup limits imposed by Slurm.

We can query the current memory cgroup of a process by:

MEM_CGROUP=$(sed -n 's/.*:memory:\(.*\)/\1/p' /proc/$$/cgroup)

($$ expands to the current PID)

and then query limits:

cat /sys/fs/cgroup/memory/$MEM_CGROUP/memory.limit_in_bytes
cat /sys/fs/cgroup/memory/$MEM_CGROUP/memory.usage_in_bytes

The question is how to this works with multiple tasks and MPI launchers.

simonbyrne commented 2 years ago

For reference, the problem is that libuv (which is used by Julia) doesn't yet support cgroups: this however was added recently: https://github.com/libuv/libuv/pull/3744 https://github.com/JuliaLang/julia/pull/46796

charleskawczynski commented 1 year ago

@simonbyrne, is this still an issue? Also, I recently noticed that the solve function first calls step, then gc, then solve. I assume that this is to gc allocations made during compiling methods? If so, one thing I realized is that this doesn’t (necessarily) call the callbacks. We could (and perhaps should) add a trigger_callbacks function and add this call before gc-ing

CliMA / ClimaAtmos.jl

Memory allocations #686

Potential solutions

Minimize memory allocations

Synchronize calls to the garbage collector