JuliaLang / julia

The Julia Programming Language
https://julialang.org/
MIT License
44.94k stars 5.42k forks source link

Synchronizing GC calls in distributed computing #53277

Open simonbyrne opened 1 year ago

simonbyrne commented 1 year ago

We're running large distributed jobs with tightly synchronized communication (halo exchanges for PDE solvers), and GC pauses cause catastrophic failures in scaling: 1 process calls the GC, which causes all neighboring processes to wait, which their neighbors and so on. A 600ms GC pause in what is a 10ms synchronization cycle quickly adds a lot of overhead. This gets worse at higher process counts, as the probablility of any 1 process invoking the GC will increase. We're using MPI.jl, but the same would apply with Distributed.jl.

At the moment our only real solution is to (a) minimize the number of allocations which occur, and (b) manually trigger the GC on all processes at the same time, which we do this by simply calling the GC at fixed intervals. This means that lots of manual tuning is required to figure out an appropriate GC interval.

Would it be possible to have some mechanism to query the memory pressure heuristics? In that way, if we know that the GC is likely to be invoked soon on a single process, we could then manually trigger it on all processes at the same time.

I'm also open to other suggestions.

eschnett commented 1 year ago

This sounds reasonable. I'd go one step further and suggest a GC hook that would allow triggering GCs on all nodes in a reliable manner, not just as a heuristic.

In a true distributed systems (probably not via MPI.jl, but with higher level interfaces) there will also be references to remote objects. A local GC can never determine whether these can be freed: Such objects are only freed when the finalizers on the remote references run, triggering a message that will reduce the object's reference count. It might also take several local GC cycles to actually reclaim such objects. In the long term it would be nice if the GC heuristics would take this scenario into account as well, probably via global statistics such as "sum of all heap bytes allocated on all nodes since the last global GC".

vchuravy commented 4 months ago

This is not specifically about Distributed.jl the use case comes from MPI.jl and needs improvements in Base.