Open syamajala opened 7 months ago
I've been reviewing this with Seshu. The symptoms are identical to what I was seeing at 8192 nodes on Frontier, but it happens at dramatically smaller node counts. I don't think I've ever seen a network freeze below 128 nodes, let alone 8.
The network variables check out and should be correct for the configuration Seshu is running.
The stack traces all appear to be effectively empty, which is consistent with what I was seeing.
The CXI debug logging doesn't print anything meaningful, which is also consistent with what I was seeing.
We checked the NIC binding and it's fine.
I don't know what else to say. These runs seem to be doing all the right things, but they're freezing anyway.
I am seeing S3D hang at 8 nodes (2 ranks/node) on Frontier after 10 timesteps. It does not look like any threads are making progress. I am running with all of @elliottslaughter flags.
There are some stack traces here: http://sapling2.stanford.edu/~seshu/s3d_tdb/frontier/stacktraces/