NERSC / podman-hpc

Apache License 2.0
32 stars 5 forks source link

occaisonal hang when creating temporary dir/files #98

Open lastephey opened 7 months ago

lastephey commented 7 months ago

Jan reports:


Hi, I'm scaling up Slurm job w/ podman, using command:

srun -n 64 podman-hpc run -it \
       --volume $outPath:/wrk \
       --workdir /wrk \
       $IMG myCode.py

and I see from time to time an error in Slurm output:

time="2023-11-18T06:58:17-08:00" level=error msg="Failed to create temp directory for user: mkdir /tmp/containers-user-31480: file exists"

but it hangs the job for 5-10 seconds, then it proceeds - is it a serious issue or I should just move on.


I think we could handle this more gracefully with a try/except: https://github.com/NERSC/podman-hpc/blob/main/podman_hpc/siteconfig.py#L317

lastephey commented 7 months ago

pathlib seems to be a popular suggestion: https://stackoverflow.com/questions/273192/how-do-i-create-a-directory-and-any-missing-parent-directories