StanfordLegion / legion

The Legion Parallel Programming System
https://legion.stanford.edu
Apache License 2.0
676 stars 145 forks source link

Legion: shardrefine slow startup on Summit #1594

Open syamajala opened 11 months ago

syamajala commented 11 months ago

I am seeing slow startup on Summit. At 4 nodes I never see it complete the first timestep after 10 minutes. Based on the stacktraces it didnt look like it was hanging though?

I took stacktraces over time here: http://sapling2.stanford.edu/~seshu/s3d_subrank_summit/

lightsighter commented 10 months ago

This looks like a hang. The backtraces are not changing between startup2 and startup3. It looks like a networking hang to me though. We're definitely waiting on messages for some of the collective operations inside of Legion and they don't appear to be making it there. We would need the collective IDs of some of the collective objects in the different backtraces to make sure they line up, but given that they are lining up on other machines for S3D, there's no reason they shouldn't be lining up here as well. Which version of GASNetEx are you running with in these examples?

syamajala commented 10 months ago

It is 2023.3.0. I also tried the -gex:immediate flag, because I vaguely remember needing that in the past on Summit, but it did not seem to help.

lightsighter commented 10 months ago

Try building with -DDEBUG_LEGION_COLLECTIVES and see if you get any errors before the run hangs.

lightsighter commented 10 months ago

The really suspicious thing is that there are some stack traces waiting on messages that aren't part of collectives though, messages that can make independent forward progress for things like trace recording, but they aren't changing.

lightsighter commented 10 months ago

I looked at these backtraces again, they definitely resemble a network hang. The messages for the trace recording run without preconditions and they're not making it to the other side.

lightsighter commented 10 months ago

You commented in #1553 that you thought shardrefine is good enough to merge. Do you still think that given you said this issue was still a problem with shardrefine in #1309?