StanfordLegion / legion

The Legion Parallel Programming System
https://legion.stanford.edu
Apache License 2.0
678 stars 145 forks source link

Realm: S3D OOM at start up on Frontier #1515

Open syamajala opened 1 year ago

syamajala commented 1 year ago

I'm looking at memory usage at start up while scaling S3D on Frontier. We are ok up to 1024 nodes. At 2048 nodes the main task starts but we hit OOM before we finish executing the first time step.

One thing that has helped reduce memory usage is changing the RemoteEventTableAllocator tree here: https://gitlab.com/StanfordLegion/legion/-/blob/control_replication/runtime/realm/runtime_impl.h?ref_type=heads#L152 from 10, 7 to 11, 5.

We need to see if we can reduce memory usage more.

Here is a histogram of the top 20 mallocs: http://sapling2.stanford.edu/~seshu/s3d_oom/total_top10_vs_size.html

There are stack traces for the 5 biggest allocations here: http://sapling2.stanford.edu/~seshu/s3d_oom/backtrace_frontier00379_53844.txt

lightsighter commented 1 year ago

Please check that this is not a duplicate of #1513

syamajala commented 1 year ago

I am on an older commit of legion from Jul 10 that does not have the caching allocator. I see that was merged on Jul 14th.

lightsighter commented 1 year ago

Other than the start-up allocations, these are all variations on the same thing: Legion waits on a event, Realm blocks, and then Realm spawns a new thread with a new stack to start running other tasks on the processor. It looks like you literally have hundreds (if not thousands) of live tasks. Most of the events Legion is waiting on look like things that are waiting on network communication to be done suggesting the network is bogged down (as usual on Frontier).

syamajala commented 1 year ago

Well our only other option is to move to rank/node but thats been blocked on #1309 and control_replication merge/new equivalence set heuristics. I'm guessing theres no update on any of that as far when it might be done?

For the allocation we received on Frontier we need to run at least some production case at 2048 nodes, if not above that, so we need to figure out a plan for how to do that.

I will start trying to scale the production case and see where it falls over, some cases in the past did run at 2048 nodes.

lightsighter commented 1 year ago

If the problem is the same as #1309 then I'm optimistic that you'll be able to start trying the shardrefine branch later this week.

syamajala commented 7 months ago

So we still have this issue when running rank/gpu on frontier. At 2048 nodes we OOM at startup, but its fine for 1 - 1024 nodes.

lightsighter commented 7 months ago

I suppose it means that this is something different than #1309 then.

What part of start-up do you have problems with? Do you even make it to the first time step?

Do you have high-water marks for the smaller node counts to see if their memory usage is growing proportional to the scale of the machine?

syamajala commented 7 months ago

I tried reducing csize from 32gb/rank to 8gb/rank and we seem to at least make it through startup and finish the first timestep. Looks like theres about 200gb of memory still available too, but its been about 7 minutes and it hasnt made it through the second timestep yet...

syamajala commented 7 months ago

I dont know if I will try to debug this with the mem_trace tools.

elliottslaughter commented 7 months ago

I tried reducing csize from 32gb/rank to 8gb/rank and we seem to at least make it through startup and finish the first timestep. Looks like theres about 200gb of memory still available too, but its been about 7 minutes and it hasnt made it through the second timestep yet...

Is this still OOM? I would have thought 200 GB would have been enough to make forward progress.

syamajala commented 2 months ago

This is still an issue when you start to run multiple ranks/node. We cannot scale 4 ranks/node past 2048 nodes. I have gotten S3D to the point where we create almost no instances in sys mem and only create instances in fb mem, which should leave around ~400gb free for the runtime, but we still OOM at 4096 nodes.