Open elliottslaughter opened 1 year ago
If it is growing at 2GB/s for tens of seconds then sampling backtraces of calls to malloc should enable you to rapidly identify the source of the memory inflation. Set a breakpoint on malloc once you see the growth start. Sample every 100th or every 1000th call and just see what is happening. I bet you notice a pattern pretty quickly.
I'm tracking down an OOM for CDUGKS on Piz Daint. The symptoms are as follows:
-ll:csize 10G
. Piz Daint nodes have 64 GB total memory, so the runtime must use roughly 54 GB in order to hit OOM.I can sort of replicate this behavior on Sapling, with the caveat that Sapling nodes have much more memory and therefore do not actually hit OOM. Watching in
htop
as the application runs, it takes a couple minutes to get through initialization, with overall memory usage per node hovering at around 20 GB. When the main time step loop begins, I see memory usage grow dramatically, at a rate of about 2 GB/s. This continues until total memory usage per node is about 200 GB. Then I see a dramatic reduction in memory usage to about 50 GB, perhaps because some memory usage for tracing has been reclaimed. The memory usage I see would be sufficient to hit OOM on Piz Daint if the nodes were of comparable size.What would be the next best step? I can provide a reproducer on Sapling if that would be useful.