CDUGKS hitting OOM at scale on Piz Daint

I'm tracking down an OOM for CDUGKS on Piz Daint. The symptoms are as follows:

We hit OOM at 1024 nodes if we use a 1× overdecomposition factor (i.e., no overdecomposition).
We hit OOM at 512 nodes if we use a 2× overdecomposition factor. This is weak scaling, so problem size per node is constant, meaning the total problem size (across the whole machine) is 2× smaller for this case than the previous bullet.
We hit OOM at 2 nodes with a 64× overdecomposition factor.
All of these use -ll:csize 10G. Piz Daint nodes have 64 GB total memory, so the runtime must use roughly 54 GB in order to hit OOM.
The OOM occurs during the main time step loop portion of the application, not during initialization.
I already resolved one leak in the application (for partitioning data) and am not aware of any remaining leaks.

I can sort of replicate this behavior on Sapling, with the caveat that Sapling nodes have much more memory and therefore do not actually hit OOM. Watching in htop as the application runs, it takes a couple minutes to get through initialization, with overall memory usage per node hovering at around 20 GB. When the main time step loop begins, I see memory usage grow dramatically, at a rate of about 2 GB/s. This continues until total memory usage per node is about 200 GB. Then I see a dramatic reduction in memory usage to about 50 GB, perhaps because some memory usage for tracing has been reclaimed. The memory usage I see would be sufficient to hit OOM on Piz Daint if the nodes were of comparable size.

What would be the next best step? I can provide a reproducer on Sapling if that would be useful.

StanfordLegion / legion

CDUGKS hitting OOM at scale on Piz Daint #1391