Running scontrol from within container

itzsimpl commented 7 months ago

In slurm many scripts use signals to get a notification before the time limit is reached. They use them to create a checkpoint and force a requeue of the job in question. One such example is Lightning (https://github.com/Lightning-AI/pytorch-lightning/blob/520c1e4713340f5bbf66de215471c2863e8fbdf2/src/lightning/pytorch/trainer/connectors/signal_connector.py#L86).

However, when running in an enroot container with pyxis, the command scontrol is not available. Any thoughts how this could be resolved? Similar, but not the same as #31; here we'd like just to call the scontrol requeue $SLURM_JOB_ID.

flx42 commented 7 months ago

I've seen multiple approaches for this. It's usually it's handled at the sbatch level, but a requeue is probably not the right approach. A requeue can cause issues with the job log, as the default is often to truncate the file. Slurm will also delay the start time of the requeued job (not sure why exactly).

Anyway, here what is commonly used:

Queue a follow-up job before launching the application: https://slurm.schedmd.com/srun.html#OPT_dependency. At the beginning of the follow-up job, it can decide to exit if a condition is met (i.e. end of training).:
Properly handle signals in the application, and assume you have time to run code in the sbatch script after the main application, then you can decide to submit a new job. If the application needs a lot of time to checkpoint, you can restrict the execution time of the srun to be slightly lower than the time allocated to the job, something like use srun -t $((sbatch_time_limit-5)).

flx42 commented 7 months ago

slurm will also delay the start time of the requeued job (not sure why exactly).

I found that it's done here: https://github.com/SchedMD/slurm/blob/slurm-23-11-0-1/src/slurmctld/job_mgr.c#L16376-L16384 A requeued job will be delayed 121 seconds, when it might have started immediately.

itzsimpl commented 7 months ago

Thanks for the info. I'm aware that there is a delay, although a nuisance I don't find it a big problem, at the moment.

The downside of the approach with queuing a follow-up job before lunching the application, is that it requires execution of multiple commands, or the automation of a single sbatch script that will do so, which can become tricky. In addition it does not really solve the main problem (how to notify a Lightning app that is running in an enroot container to create a checkpoint if this happens to be between two regular checkpoints). I'm aware that one can increase the frequency of checkpointing, what I'm asking really, is, is there is a way to make Lighting that is run in an enroot/pyxis container behave as if it were run on baremetal.

Giving the step a slightly lower than the allocated time will ensure that the sbatch script's signal handler will have some time left. But the app that is run in the srun step will be dead, so no chance of it performing a checkpoint. Note also that when SLURM sends an abort signal it will wait for some time on its own https://github.com/SchedMD/slurm/blob/769da35200d4a2c0f42a6e060b2b180ed95bfc8e/src/api/step_launch.c#L671.

We currently do the following. The main sbatch handles the signal (B:USR1@90), the srun command is run in the background so that the sbatch script will get notified 90s before the job time is exhausted. This requeues the job as desired, but has the downside is that the Lightning based scripts, which run in the enroot container (started via the srun) will not create a checkpoint before being requeued. Even if the sbatch script, on receipt of the signal, sends yet another signal (USR2) to the srun PID (i.e. the Lightning based scripts that run in enroot/pyxis), the script somehow does not seem to do a checkpoint, which it should before calling the requeue. The signal is properly configured (https://lightning.ai/docs/pytorch/stable/clouds/cluster_advanced.html#enable-auto-wall-time-resubmissions) and the Lightning code that handles it is here https://github.com/Lightning-AI/pytorch-lightning/blob/520c1e4713340f5bbf66de215471c2863e8fbdf2/src/lightning/pytorch/trainer/connectors/signal_connector.py#L67-L103.

flx42 commented 7 months ago

You should consider the checkpointing and the follow-up job separately. They are both orthogonal to containers.

Checkpoints: the srun application should be the one to get the signal, and on receiving the signal it will perform the checkpoint, then exit to let the sbatch script finish executing. So do not use B:USR1@90.
Follow-up job: handle that in the sbatch before or after the srun. The main challenge is when the checkpoint takes more time than the configured KillWait on the cluster, then you can either pre-queue the follow-up, use a smaller time limit for the srun, use --signal, etc.

The approach above also works for regular PyTorch. It looks like there is a problem with signal handling in PyTorch Lightning, and the requeue does not work inside containers, but you don't need to rely on any of those.

NVIDIA / pyxis

Running scontrol from within container #129