Open lastephey opened 7 months ago
Jan reports:
Hi, I'm scaling up Slurm job w/ podman, using command:
srun -n 64 podman-hpc run -it \ --volume $outPath:/wrk \ --workdir /wrk \ $IMG myCode.py
and I see from time to time an error in Slurm output:
time="2023-11-18T06:58:17-08:00" level=error msg="Failed to create temp directory for user: mkdir /tmp/containers-user-31480: file exists"
but it hangs the job for 5-10 seconds, then it proceeds - is it a serious issue or I should just move on.
I think we could handle this more gracefully with a try/except: https://github.com/NERSC/podman-hpc/blob/main/podman_hpc/siteconfig.py#L317
pathlib seems to be a popular suggestion: https://stackoverflow.com/questions/273192/how-do-i-create-a-directory-and-any-missing-parent-directories
Jan reports:
Hi, I'm scaling up Slurm job w/ podman, using command:
and I see from time to time an error in Slurm output:
but it hangs the job for 5-10 seconds, then it proceeds - is it a serious issue or I should just move on.
I think we could handle this more gracefully with a try/except: https://github.com/NERSC/podman-hpc/blob/main/podman_hpc/siteconfig.py#L317