StanfordLegion / legion

The Legion Parallel Programming System
https://legion.stanford.edu
Apache License 2.0
678 stars 145 forks source link

CDUGKS hitting OOM at scale on Piz Daint #1391

Open elliottslaughter opened 1 year ago

elliottslaughter commented 1 year ago

I'm tracking down an OOM for CDUGKS on Piz Daint. The symptoms are as follows:

I can sort of replicate this behavior on Sapling, with the caveat that Sapling nodes have much more memory and therefore do not actually hit OOM. Watching in htop as the application runs, it takes a couple minutes to get through initialization, with overall memory usage per node hovering at around 20 GB. When the main time step loop begins, I see memory usage grow dramatically, at a rate of about 2 GB/s. This continues until total memory usage per node is about 200 GB. Then I see a dramatic reduction in memory usage to about 50 GB, perhaps because some memory usage for tracing has been reclaimed. The memory usage I see would be sufficient to hit OOM on Piz Daint if the nodes were of comparable size.

What would be the next best step? I can provide a reproducer on Sapling if that would be useful.

lightsighter commented 1 year ago

If it is growing at 2GB/s for tens of seconds then sampling backtraces of calls to malloc should enable you to rapidly identify the source of the memory inflation. Set a breakpoint on malloc once you see the growth start. Sample every 100th or every 1000th call and just see what is happening. I bet you notice a pattern pretty quickly.