NVIDIA / pyxis

Container plugin for Slurm Workload Manager
Apache License 2.0
263 stars 28 forks source link

Running scontrol from within container #129

Open itzsimpl opened 7 months ago

itzsimpl commented 7 months ago

In slurm many scripts use signals to get a notification before the time limit is reached. They use them to create a checkpoint and force a requeue of the job in question. One such example is Lightning (https://github.com/Lightning-AI/pytorch-lightning/blob/520c1e4713340f5bbf66de215471c2863e8fbdf2/src/lightning/pytorch/trainer/connectors/signal_connector.py#L86).

However, when running in an enroot container with pyxis, the command scontrol is not available. Any thoughts how this could be resolved? Similar, but not the same as #31; here we'd like just to call the scontrol requeue $SLURM_JOB_ID.

flx42 commented 7 months ago

I've seen multiple approaches for this. It's usually it's handled at the sbatch level, but a requeue is probably not the right approach. A requeue can cause issues with the job log, as the default is often to truncate the file. Slurm will also delay the start time of the requeued job (not sure why exactly).

Anyway, here what is commonly used:

flx42 commented 7 months ago

slurm will also delay the start time of the requeued job (not sure why exactly).

I found that it's done here: https://github.com/SchedMD/slurm/blob/slurm-23-11-0-1/src/slurmctld/job_mgr.c#L16376-L16384 A requeued job will be delayed 121 seconds, when it might have started immediately.

itzsimpl commented 7 months ago

Thanks for the info. I'm aware that there is a delay, although a nuisance I don't find it a big problem, at the moment.

The downside of the approach with queuing a follow-up job before lunching the application, is that it requires execution of multiple commands, or the automation of a single sbatch script that will do so, which can become tricky. In addition it does not really solve the main problem (how to notify a Lightning app that is running in an enroot container to create a checkpoint if this happens to be between two regular checkpoints). I'm aware that one can increase the frequency of checkpointing, what I'm asking really, is, is there is a way to make Lighting that is run in an enroot/pyxis container behave as if it were run on baremetal.

Giving the step a slightly lower than the allocated time will ensure that the sbatch script's signal handler will have some time left. But the app that is run in the srun step will be dead, so no chance of it performing a checkpoint. Note also that when SLURM sends an abort signal it will wait for some time on its own https://github.com/SchedMD/slurm/blob/769da35200d4a2c0f42a6e060b2b180ed95bfc8e/src/api/step_launch.c#L671.

We currently do the following. The main sbatch handles the signal (B:USR1@90), the srun command is run in the background so that the sbatch script will get notified 90s before the job time is exhausted. This requeues the job as desired, but has the downside is that the Lightning based scripts, which run in the enroot container (started via the srun) will not create a checkpoint before being requeued. Even if the sbatch script, on receipt of the signal, sends yet another signal (USR2) to the srun PID (i.e. the Lightning based scripts that run in enroot/pyxis), the script somehow does not seem to do a checkpoint, which it should before calling the requeue. The signal is properly configured (https://lightning.ai/docs/pytorch/stable/clouds/cluster_advanced.html#enable-auto-wall-time-resubmissions) and the Lightning code that handles it is here https://github.com/Lightning-AI/pytorch-lightning/blob/520c1e4713340f5bbf66de215471c2863e8fbdf2/src/lightning/pytorch/trainer/connectors/signal_connector.py#L67-L103.

flx42 commented 7 months ago

You should consider the checkpointing and the follow-up job separately. They are both orthogonal to containers.

The approach above also works for regular PyTorch. It looks like there is a problem with signal handling in PyTorch Lightning, and the requeue does not work inside containers, but you don't need to rely on any of those.