Open syamajala opened 11 months ago
This looks like a hang. The backtraces are not changing between startup2 and startup3. It looks like a networking hang to me though. We're definitely waiting on messages for some of the collective operations inside of Legion and they don't appear to be making it there. We would need the collective IDs of some of the collective objects in the different backtraces to make sure they line up, but given that they are lining up on other machines for S3D, there's no reason they shouldn't be lining up here as well. Which version of GASNetEx are you running with in these examples?
It is 2023.3.0. I also tried the -gex:immediate
flag, because I vaguely remember needing that in the past on Summit, but it did not seem to help.
Try building with -DDEBUG_LEGION_COLLECTIVES
and see if you get any errors before the run hangs.
The really suspicious thing is that there are some stack traces waiting on messages that aren't part of collectives though, messages that can make independent forward progress for things like trace recording, but they aren't changing.
I looked at these backtraces again, they definitely resemble a network hang. The messages for the trace recording run without preconditions and they're not making it to the other side.
You commented in #1553 that you thought shardrefine is good enough to merge. Do you still think that given you said this issue was still a problem with shardrefine in #1309?
I am seeing slow startup on Summit. At 4 nodes I never see it complete the first timestep after 10 minutes. Based on the stacktraces it didnt look like it was hanging though?
I took stacktraces over time here: http://sapling2.stanford.edu/~seshu/s3d_subrank_summit/