NVIDIA / enroot

A simple yet powerful tool to turn traditional container/OS images into unprivileged sandboxes.
Apache License 2.0
648 stars 94 forks source link

How to setup the temporary files properly with multiple users through pyxis + enroot? #86

Closed crinavar closed 3 years ago

crinavar commented 3 years ago

Hi Community, I am having an issue everytime our DGX node reboots, because the /tmp/enroot.debug file gets created with permissions only for the user who launched the SLURM job through pyxis. Lets say the DGX system reboots and the user "john" is the first one to run a slurm job, then the temporary file /tmp/enroot.debug gets created with "john" as owner and write permissions only for him. When a second "mary" launches a job, it gets a permission denied in pyxis for the file /tmp/enroot.debug. What I am doing is that I have to manually change the permissions of /tmp/enroot.debug to 1777 and change owner to root.

What is the best way to setup pyxis+enroot so that it can work properly after a reboot? many thanks in advance

flx42 commented 3 years ago

Hi,

I'm not sure how you end up with /tmp/enroot.debug, pyxis / enroot are not writing to this file by default. Can you share the exact error you're seeing?

Thanks

crinavar commented 3 years ago

Hi flx42, thanks for the fast response and sorry for not providing error messages, it was because at the time of writing the post I had already done the manual fix and users are executing jobs right now. Let me find a time window in the week with no jobs running and I will reboot the system to get the errors again.

crinavar commented 3 years ago

Hi flx42, sorry for the delay, I still cannot find a moment to reproduce the problem, but here is the error I managed to recover from the last time a user had the problem

slurmstepd: error: pyxis: container start failed with error code: 1
slurmstepd: error: pyxis: printing contents of log file ... 
slurmstepd: error: pyxis:     /etc/enroot/hooks.d/98-nvidia.sh: line 44: /tmp/enroot.debug: Permission denied
slurmstepd: error: pyxis:     [ERROR] /etc/enroot/hooks.d/98-nvidia.sh exited with return code 1
slurmstepd: error: pyxis: couldn't start container
slurmstepd: error: pyxis: if the image has an unusual entrypoint, try using --no-container-entrypoint
slurmstepd: error: spank: required plugin spank_pyxis.so: task_init() failed with rc=-1
slurmstepd: error: Failed to invoke spank plugin stack
srun: error: nodeGPU01: task 0: Exited with exit code 1

Indeed maybe this is a configuration that the deployment team could have set up either on pyxis side or enroot. EDIT: I will try to check the 98-nvidia.sh script as soon as possible.

flx42 commented 3 years ago

Yeah, looks like the 98-nvidia.sh file was modified manually. Compare it with https://github.com/NVIDIA/enroot/blob/v3.3.0/conf/hooks/98-nvidia.sh

3XX0 commented 3 years ago

Closing as it appears that the hook has been modified. Feel free to reopen if it's indeed a bug in enroot