SCR caches checkpoint data in storage on the compute nodes of a Linux cluster to provide a fast, scalable checkpoint / restart capability for MPI codes.
We want to be sure that scalable restart is supported in a Flux allocation in which a user has allocated a spare node. As an example, the full test process for that would be:
Allocate 3 nodes (the following steps all take place within this 3 node allocation)
Run an SCR test_api job on 2 of the 3 nodes. Wait for that job to write at least one checkpoint. Use /dev/shm with the XOR redundancy scheme (both on by default). One can use the --seconds option to have test_api sleep for some time between checkpoints to extend the run.
Kill one of the two nodes (power off a node) while the job is running. We know that we can't kill a Flux router node.
Ensure that current test_api job dies and that control comes back to the job script. This should happen fast, ideally within seconds of killing the node, but it is still acceptable if it takes a couple minutes.
The scr_run script should then be able to successfully detect the downed node via scr_list_down_nodes.
Let's check if Flux offers an API to query for nodes that it knows to be down. If it has such an API, does it work? If it does not, we should open a Flux ticket to investigate.
Let's also ensure that we can detect the down node independently of Flux using other tests run by scr_list_down_nodes. This is important because often if can take many minutes before a resource manager realizes that a node is down. For example, the way SLURM is configured, it checks nodes on a 5-10 minute heartbeat. If we only rely on the resource manager, that we might continually attempt to launch new jobs onto dead nodes for 5-10 minutes until the resource manager finally detects and reports the nodes as dead.
Thus, we like to do both. We ask the resource manager for its list of down nodes, and then we also do our own independent check of nodes that are thought to still be healthy.
Given a list of dead nodes, we then want the scr_run script to put together a new launch command that only uses healthy nodes. In this case, it should detect the node that was killed and piece together a command to launch a 2-node run using the 2 remaining healthy nodes.
Launch the new job and be sure that it actually starts to run. When testing other resource managers, we have found that the second MPI job sometimes fails to launch and/or hangs after a node failure. We want to be sure that this next run does not hang and does not take too long to get started. In some cases, like with jsrun, we found we needed to sleep for a few minutes to give jsrun time to find and exclude the down node, or otherwise the next jsrun command would hang.
When the next test_api job starts to run, we need to verify that SCR correctly detects, rebuilds, and restarts from the most recent checkpoint, which was saved in /dev/shm and protected with XOR. It should have to rebuild the files that were stored on the node that was killed.
Here is the old issue opened with Flux about tolerating node failures.
https://github.com/flux-framework/flux-core/issues/4417
We want to be sure that scalable restart is supported in a Flux allocation in which a user has allocated a spare node. As an example, the full test process for that would be:
Allocate 3 nodes (the following steps all take place within this 3 node allocation)
Run an SCR test_api job on 2 of the 3 nodes. Wait for that job to write at least one checkpoint. Use /dev/shm with the XOR redundancy scheme (both on by default). One can use the --seconds option to have test_api sleep for some time between checkpoints to extend the run.
Kill one of the two nodes (power off a node) while the job is running. We know that we can't kill a Flux router node.
Ensure that current test_api job dies and that control comes back to the job script. This should happen fast, ideally within seconds of killing the node, but it is still acceptable if it takes a couple minutes.
The scr_run script should then be able to successfully detect the downed node via scr_list_down_nodes.
Given a list of dead nodes, we then want the scr_run script to put together a new launch command that only uses healthy nodes. In this case, it should detect the node that was killed and piece together a command to launch a 2-node run using the 2 remaining healthy nodes.
Launch the new job and be sure that it actually starts to run. When testing other resource managers, we have found that the second MPI job sometimes fails to launch and/or hangs after a node failure. We want to be sure that this next run does not hang and does not take too long to get started. In some cases, like with jsrun, we found we needed to sleep for a few minutes to give jsrun time to find and exclude the down node, or otherwise the next jsrun command would hang.
When the next test_api job starts to run, we need to verify that SCR correctly detects, rebuilds, and restarts from the most recent checkpoint, which was saved in /dev/shm and protected with XOR. It should have to rebuild the files that were stored on the node that was killed.