NVIDIA / libnvidia-container

NVIDIA container runtime library
Apache License 2.0
809 stars 198 forks source link

error using nvidia-container-cli for enroot (bug) #125

Open KapilS25 opened 3 years ago

KapilS25 commented 3 years ago

As reported by enroot developer, kindly look into this : https://github.com/NVIDIA/enroot/issues/54#issuecomment-762169057

klueska commented 3 years ago

Thanks for the report. There is a long thread on that link (some of which is relevant, some of which is not). Can you summarize the exact bug you are seeing with libnvidia-container here.

KapilS25 commented 3 years ago

With cgroups ,nvidia-container-cli unable to mount /dev from host to inside the containers /dev
need to use --no-devbind flag with nvidia-container-cli , which should not be a case , as mentioned by enroot developer. https://github.com/NVIDIA/enroot/issues/54#issuecomment-762148027

klueska commented 3 years ago

I don't know anything about enroot. Do you have a simple reproducer with nvidia-container-cli directly that I can use to see what your issue is?

KapilS25 commented 3 years ago

Adding @3XX0 (enroot developer) in the conversation, @3XX0 can you please explain the issue to @klueska , as i dont know how exactly enroot start is using nvidia-container-cli.

3XX0 commented 3 years ago

Basically it looks like the device mount fails if the device already exists at the destination. I've never seen this before, so this might be RHEL specific:

mount error: file creation failed: /scratch/pbs/enroot-data/user-613.chas052/lammps/dev/nvidia-uvm-tools: operation not permitted

/dev/nvidia-uvm-tools already exists because /dev is bind mounted in the container, so mount shouldn't try to create it.

@KapilS25 Can you try adding strace to nvidia-container-cli in the nvidia hook so we can see the exact failure on open

KapilS25 commented 3 years ago

Please find attached output file for nvidia-container-cli with strace. dev_mount_issue_nvidia-container-cli.strace.txt

3XX0 commented 3 years ago

Thanks, this makes sense now, the umask will make the open fail as it tries to adjust permissions

klueska commented 3 years ago

So it sounds like this is not actually a bug in libnvidia-container then, but rather expected behaviour given the umask set on /dev/nvidia-uvm-tools.

3XX0 commented 3 years ago

It is a bug, the file exists and can just be mounted over. Libnvidia-container shouldn't try to adjust the permission of a device file to reflect the system umask. Permissions of the underlying file actually don't really matter.

ydm-amazon commented 7 months ago

I've also encountered the same bug