More reliable server bootstrapping at scale

sandrain commented 3 years ago

System information

Summit

Describe the problem you're observing

The unifyfsd bootstrapping (i.e., establishing connections between peers) often fails when a large number of compute nodes are used, e.g., 500+ servers.

unifyfsd relies on pmi2 or pmix to acquire job allocation information and group services like fence/barrier. However, these options seem unreliable when a job is launched with a large number of nodes (at least on summit as I have experienced). As a last resort, unifyfsd can bootstrap with peers without help from external libraries, by using a shared file system. However, there are also times that the shared file system (like GPFS) behaves badly (e.g., sudden slowdown of operations). I think we need to investigate if we can make the bootstrapping process more reliable.

Describe how to reproduce the problem

Run the unifyfsd with a large number of servers, e.g., 500+ servers.

Include any warning or errors or releveant debugging data

N/A

MichaelBrim commented 3 years ago

Just another datapoint here, recent runs on Summit with PMIx still start failing to bootstrap at around 512 nodes/servers. Not an every time failure, but still problematic for production use with large scale jobs.

wangvsa commented 6 months ago

Adding another datapoint. On Frontier, 628 nodes, 8 ranks/node, bootstrap seems to TINEOUT consistently at unifyfs_invoke_broadcast_bootstrap_complete() call.

LLNL / UnifyFS