NVIDIA / enroot

A simple yet powerful tool to turn traditional container/OS images into unprivileged sandboxes.
Apache License 2.0
649 stars 94 forks source link

Critical issue; invalid parameter potential leading to loss of files #215

Closed itzsimpl closed 1 month ago

itzsimpl commented 1 month ago

On our Slurm cluster we use pyxis and are very happy with it. We have it setup without automatic HOME mount (if this has any influence on the current issue at all).

Recently we had a user who initiated the command

srun --container-image=<image url> --container-save="./" bash -c '<command>'

from his home folder and managed to delete all of his existing files.

A quick review of pyxis code indicates that this may be because, based on the documentation, it is simply assumed that the user will supply a file path, as instructed, not a folder path. Could something in the likes of

    if (optarg[strlen(optarg)-1] == '/') {
        slurm_error("pyxis: --container-save: argument must be file, not path!")
        return (-1)
    }

be added to https://github.com/NVIDIA/pyxis/blob/424563865a491f2d8f3c396a095d18db81f1fb07/args.c#L385

flx42 commented 1 month ago

@3XX0 for review, I think maybe we can fix that in enroot, by changing this line: https://github.com/NVIDIA/enroot/blob/b73986ae50b075601be393108d97dae1f50943a1/src/runtime.sh#L599

flx42 commented 1 month ago

@itzsimpl if you manually apply the patch 2bd51434bbc427a1d8463e5682c89bac4b8fda51 then this issue should not be possible anymore. Sorry about that!

itzsimpl commented 3 weeks ago

@flx42 thank you for the patch, I can confirm that the loss of data does not occur anymore. The downside of the resolution being pushed to Enroot is that now all the "work" done inside the container is lost. In other words, if Pyxis handles the case, the user will get notified before the job even starts; if Enroot handles the case, the user will update the container, but on exit it will fail on save. From the UX perspective this may be worth to consider.

A simple test.

$ srun -p dev --container-image=ubuntu:20.04 --container-save=./ ls
srun: job 8203 queued and waiting for resources
srun: job 8203 has been allocated resources
pyxis: importing docker image: ubuntu:20.04
pyxis: imported docker image: ubuntu:20.04
bin
boot
dev
etc
home
lib
lib32
lib64
libx32
local
media
mnt
opt
proc
root
run
sbin
shared
srv
sys
tmp
usr
var
slurmstepd: error: pyxis: child 220192 failed with error code: 1
slurmstepd: error: pyxis: printing enroot log file:
slurmstepd: error: pyxis:     rm: cannot remove '/shared/home/ilb/test-remove': Is a directory
slurmstepd: error: pyxis: failed to export container pyxis_8203_8203.0 to ./
slurmstepd: error: spank: required plugin spank_pyxis.so: task_exit() failed with rc=-1
slurmstepd: error: Unable to spank task 0 at exit
flx42 commented 3 weeks ago

Yes I can also add a check to pyxis, but it will not catch all cases, like this: --container-save=/home/flx42

itzsimpl commented 3 weeks ago

You're right, trying to handle all possible cases might be an overkill. Thank you for the quick fix.