Open sandrain opened 3 years ago
Just another datapoint here, recent runs on Summit with PMIx still start failing to bootstrap at around 512 nodes/servers. Not an every time failure, but still problematic for production use with large scale jobs.
Adding another datapoint. On Frontier, 628 nodes, 8 ranks/node, bootstrap seems to TINEOUT consistently at unifyfs_invoke_broadcast_bootstrap_complete()
call.
System information
Summit
Describe the problem you're observing
The
unifyfsd
bootstrapping (i.e., establishing connections between peers) often fails when a large number of compute nodes are used, e.g., 500+ servers.unifyfsd
relies onpmi2
orpmix
to acquire job allocation information and group services like fence/barrier. However, these options seem unreliable when a job is launched with a large number of nodes (at least on summit as I have experienced). As a last resort,unifyfsd
can bootstrap with peers without help from external libraries, by using a shared file system. However, there are also times that the shared file system (like GPFS) behaves badly (e.g., sudden slowdown of operations). I think we need to investigate if we can make the bootstrapping process more reliable.Describe how to reproduce the problem
Run the unifyfsd with a large number of servers, e.g., 500+ servers.
Include any warning or errors or releveant debugging data
N/A