NVIDIA / pyxis

Container plugin for Slurm Workload Manager
Apache License 2.0
266 stars 30 forks source link

srun exits while container export is still running #76

Closed jfolz closed 2 years ago

jfolz commented 2 years ago

Some of our users alerted us that they could not launch their images saved with --container-save. We found this was because they were too fast. srun had exited, suggesting to them that the saved image was now usable, yet the export process was still running on the compute node. This seems odd, since the code suggests that it should wait until the export finishes.

We're running Slurm 20.02.7, Pyxis 0.12.0, and Enroot 3.1.0. I can't say for certain if it was working with earlier versions of Pyxis, or if by coincidence the export always finished before attempting to use the image, but we were only recently made aware of this.

flx42 commented 2 years ago

Thanks for the bug report. I can reproduce the issue by injecting an artificial delay before saving the container image. So looks like there slurmstepd for step N is indeed executing concurrently with job N+1. This might require a slightly different approach in the code.

flx42 commented 2 years ago

Hopefully this is fixed in 0.13. Let me know if you get a chance to deploy the new version to confirm it is fixed.

jfolz commented 2 years ago

That was fast 😄 will do.

jfolz commented 2 years ago

Can confirm it now works as intended 👍 thanks again for the quick turnaround 😄