NVIDIA / pyxis

Container plugin for Slurm Workload Manager
Apache License 2.0
281 stars 31 forks source link

Pyxis randomnly hangs on some imports #106

Closed slurmuser closed 1 year ago

slurmuser commented 1 year ago

Hi we have a weird issue that we can't debug. The import appears to just hang. Using enroot it appears to import fine but with pyxis often it will just hang with:

pyxis: importing docker image ...

This occurs roughly 25% of the time. I don't really know what logs to pass and it seems it would be an issue with pyxis since it does sometimes work so the container is clearly healthy. The container also works fine with podman.

flx42 commented 1 year ago

You should check the following:

  1. Is there an enroot process still executing on the machine? (check with ps or top/htop). If so, then the import is probably merely slow, not stuck.
  2. Try using a single task and a simple command to execute, like this: srun --ntasks=1 --container-image=... hostname. Sometimes it's the command inside the container that hangs, but the log from pyxis can make you believe that it's still importing the image. In recent pyxis versions the print is after the import is done.
  3. Check the slurmd log for more pyxis logs such as when the container is created and started. You can increase the slurmd verbosity before doing so.