Closed jfolz closed 2 years ago
Thanks for the bug report. I can reproduce the issue by injecting an artificial delay before saving the container image. So looks like there slurmstepd for step N is indeed executing concurrently with job N+1. This might require a slightly different approach in the code.
Hopefully this is fixed in 0.13. Let me know if you get a chance to deploy the new version to confirm it is fixed.
That was fast 😄 will do.
Can confirm it now works as intended 👍 thanks again for the quick turnaround 😄
Some of our users alerted us that they could not launch their images saved with
--container-save
. We found this was because they were too fast.srun
had exited, suggesting to them that the saved image was now usable, yet the export process was still running on the compute node. This seems odd, since the code suggests that it should wait until the export finishes.We're running Slurm 20.02.7, Pyxis 0.12.0, and Enroot 3.1.0. I can't say for certain if it was working with earlier versions of Pyxis, or if by coincidence the export always finished before attempting to use the image, but we were only recently made aware of this.