StanfordLegion / legion

The Legion Parallel Programming System
https://legion.stanford.edu
Apache License 2.0
657 stars 146 forks source link

S3D hang on Frontier #1683

Open syamajala opened 2 months ago

syamajala commented 2 months ago

I am seeing S3D hang at 8 nodes (2 ranks/node) on Frontier after 10 timesteps. It does not look like any threads are making progress. I am running with all of @elliottslaughter flags.

There are some stack traces here: http://sapling2.stanford.edu/~seshu/s3d_tdb/frontier/stacktraces/

elliottslaughter commented 2 months ago

I've been reviewing this with Seshu. The symptoms are identical to what I was seeing at 8192 nodes on Frontier, but it happens at dramatically smaller node counts. I don't think I've ever seen a network freeze below 128 nodes, let alone 8.

The network variables check out and should be correct for the configuration Seshu is running.

The stack traces all appear to be effectively empty, which is consistent with what I was seeing.

The CXI debug logging doesn't print anything meaningful, which is also consistent with what I was seeing.

We checked the NIC binding and it's fine.

I don't know what else to say. These runs seem to be doing all the right things, but they're freezing anyway.