NVIDIA / enroot

A simple yet powerful tool to turn traditional container/OS images into unprivileged sandboxes.
Apache License 2.0
648 stars 94 forks source link

enroot-switchroot: failed to execute: /bin/sh: Permission denied #156

Open szhengac opened 1 year ago

szhengac commented 1 year ago

Hi,

I am testing Megatron training on a demo machine with slurm. The machine has very limited disk size. And, I have to change the squashfs filesystem path to /run (tmpfs filesystem) by using srun --export="XDG_DATA_HOME=/run/tmp". I am not sure if this is the right way to change the default path off the home directory, but the error message No space left on device was gone after setting this env var. However, there is another error coming out as shown below. Any advice will be very much appreciated.

[2023-05-04T22:50:32.199] launch task StepId=81.0 request from UID:1002 GID:1002 HOST:127.0.0.1 PORT:60098
[2023-05-04T22:50:32.200] task/affinity: lllp_distribution: JobId=81 implicit auto binding: cores, dist 8192
[2023-05-04T22:50:32.200] task/affinity: _task_layout_lllp_cyclic: _task_layout_lllp_cyclic 
[2023-05-04T22:50:32.200] task/affinity: _lllp_generate_cpu_bind: _lllp_generate_cpu_bind jobid [81]: mask_cpu, 0x00000000000000000000000000010000000000000000000000000001
[2023-05-04T22:50:32.273] [81.0] pyxis: creating container filesystem: pyxis_81.0
[2023-05-04T22:50:44.775] [81.0] pyxis: starting container: pyxis_81.0
[2023-05-04T22:50:44.984] [81.0] error: pyxis: container start failed with error code: 1
[2023-05-04T22:50:44.984] [81.0] error: pyxis: printing enroot log file:
[2023-05-04T22:50:44.984] [81.0] error: pyxis:     enroot-switchroot: failed to execute: /bin/sh: Permission denied
[2023-05-04T22:50:44.984] [81.0] error: pyxis: couldn't start container
[2023-05-04T22:50:44.984] [81.0] error: spank: required plugin spank_pyxis.so: task_init() failed with rc=-1
[2023-05-04T22:50:44.984] [81.0] error: Failed to invoke spank plugin stack
[2023-05-04T22:50:44.986] [81.0] pyxis: removing container filesystem: pyxis_81.0
[2023-05-04T22:50:46.968] [81.0] done with job

/bin/sh permission should be correct:

lrwxrwxrwx 1 root root 4 Mar 23  2022 /bin/sh -> dash
flx42 commented 1 year ago

Where are the container filesystems stored? i.e. the value of ENROOT_DATA_PATH, and how is it mounted?

This error can happen when ENROOT_DATA_PATH is on a filesystem mounted with noexec

szhengac commented 1 year ago

@flx42 I use enroot import to preprocess the docker image to sqsh format in advance and put it under /run (tmpfs filesystem), otherwise it will use ~/.cache and /tmp and the root disk does not have much available space. ENROOT_DATA_PATH is empty on my system. By default, what I see is that everything is put under ~/.local

3XX0 commented 1 year ago

--export="XDG_DATA_HOME=/run/tmp is going to influence ENROOT_DATA_PATH and /run is noexec on most distributions. I suggest you configure it in /etc/enroot/enroot.conf to some other location. Also careful with --export as it will unset all the environment.

szhengac commented 1 year ago

I have managed to clean up a disk for this. But thanks for the advice. BTW, I think we can keep the environment by adding ALL to --export?