Open syamajala opened 1 year ago
Please check that this is not a duplicate of #1513
I am on an older commit of legion from Jul 10 that does not have the caching allocator. I see that was merged on Jul 14th.
Other than the start-up allocations, these are all variations on the same thing: Legion waits on a event, Realm blocks, and then Realm spawns a new thread with a new stack to start running other tasks on the processor. It looks like you literally have hundreds (if not thousands) of live tasks. Most of the events Legion is waiting on look like things that are waiting on network communication to be done suggesting the network is bogged down (as usual on Frontier).
Well our only other option is to move to rank/node but thats been blocked on #1309 and control_replication merge/new equivalence set heuristics. I'm guessing theres no update on any of that as far when it might be done?
For the allocation we received on Frontier we need to run at least some production case at 2048 nodes, if not above that, so we need to figure out a plan for how to do that.
I will start trying to scale the production case and see where it falls over, some cases in the past did run at 2048 nodes.
If the problem is the same as #1309 then I'm optimistic that you'll be able to start trying the shardrefine
branch later this week.
So we still have this issue when running rank/gpu on frontier. At 2048 nodes we OOM at startup, but its fine for 1 - 1024 nodes.
I suppose it means that this is something different than #1309 then.
What part of start-up do you have problems with? Do you even make it to the first time step?
Do you have high-water marks for the smaller node counts to see if their memory usage is growing proportional to the scale of the machine?
I tried reducing csize from 32gb/rank to 8gb/rank and we seem to at least make it through startup and finish the first timestep. Looks like theres about 200gb of memory still available too, but its been about 7 minutes and it hasnt made it through the second timestep yet...
I dont know if I will try to debug this with the mem_trace tools.
I tried reducing csize from 32gb/rank to 8gb/rank and we seem to at least make it through startup and finish the first timestep. Looks like theres about 200gb of memory still available too, but its been about 7 minutes and it hasnt made it through the second timestep yet...
Is this still OOM? I would have thought 200 GB would have been enough to make forward progress.
This is still an issue when you start to run multiple ranks/node. We cannot scale 4 ranks/node past 2048 nodes. I have gotten S3D to the point where we create almost no instances in sys mem and only create instances in fb mem, which should leave around ~400gb free for the runtime, but we still OOM at 4096 nodes.
I'm looking at memory usage at start up while scaling S3D on Frontier. We are ok up to 1024 nodes. At 2048 nodes the main task starts but we hit OOM before we finish executing the first time step.
One thing that has helped reduce memory usage is changing the RemoteEventTableAllocator tree here: https://gitlab.com/StanfordLegion/legion/-/blob/control_replication/runtime/realm/runtime_impl.h?ref_type=heads#L152 from 10, 7 to 11, 5.
We need to see if we can reduce memory usage more.
Here is a histogram of the top 20 mallocs: http://sapling2.stanford.edu/~seshu/s3d_oom/total_top10_vs_size.html
There are stack traces for the 5 biggest allocations here: http://sapling2.stanford.edu/~seshu/s3d_oom/backtrace_frontier00379_53844.txt