Closed mjg0 closed 2 years ago
Moved to libnvidia-container, since this looks like an issue with the new firmware path.
@mjg0 you can try adding a strace
in front of the CLI in the 98-nvidia.sh
hook to see where it is coming from.
Hi @mjg0 / @3XX0. In order to support devices using GSP firmware (e.g. A100 80GB devices), we mount the firmware from the host into the container.
As @3XX0 mentions, the strace
output would be useful as well as some more information on the system you are seeing this on.
Our cluster has two types of GPU nodes: one type with 2 K80s and 2 Haswell Xeon E5-2670s, and another type with 4 P100s and 2 Broadwell Xeon E5-2680s. The OS (including /usr/lib/firmware) for all nodes is NFS-mounted, and the extracted container is in /tmp which is the node-local disk. The failure is identical on both types of nodes, and even on non-GPU nodes with different CPUs.
Both enroot
and libnvidia-container
were installed from source using GCC 11.2.0:
# enroot build
git clone --recurse-submodules https://github.com/NVIDIA/enroot.git
cd enroot
git checkout tags/v3.4.0
DESTDIR=/apps/enroot/3.4.0/gcc-11.2.0 make -j install prefix=
# I had to comment out the `/etc/hostname` line in the fstab config file since our OS has no /etc/hostname
# libnvidia-container build
# download release from Github, extract, then:
DESTDIR=/apps/libnvidia-container/1.5.1/gcc-11.2.0 make -j install prefix=
# the RUNPATH was `$ORIGIN/../$LIB`, I had to change it so it could find libnvidia-container.so.1
patchelf --set-rpath '$ORIGIN/../lib' nvidia-container-cli
Here is the log resulting from adding strace -o enroot-strace.log
after the exec
on the last line of 98-nvidia.sh
: enroot-strace.log.
mount(NULL, "/tmp/enroot/pccl+containertest+0.1/usr/lib/firmware/nvidia/470.57.02", NULL, MS_NOSUID|MS_NOEXEC|MS_REMOUNT|MS_BIND, NULL) = -1 EPERM (Operation not permitted)
Looks like a bug in the mount code, it tries to unconditionally set nosuid, noexec
, instead the mount needs to account for the filesystem flags. You can probably downgrade libnvidia-container (before the inclusion of the firmware directory) as a workaround for now.
I downgraded libnvidia-container
to 1.4.0 and it worked--thank you for the insight! @elezar if you know what code needs to be changed I can test a patch, or try to fix it and submit a pull request if you point me in the right direction.
@mjg0 thanks confirming that downgrading works. I will share some links to code locations if you're still up for getting something working on your end.
For the time being, could you provide more information on the properties of the /lib/firmware/nvidia/470.57.02
folder on your system (I don't have ready access to a system that uses GSP firmware).
The only file in that directory is gsp.bin
. It's an ELF, but without execute permissions.
I'm certainly good to look around a bit--where do you think I should start?
Thanks @mjg0. The mount that is failing would be the one here which is called for the firmware directory here. The firmware directory is currently the only element of info->dirs
at the call site.
A "quick and dirty" approach to get this fixed on your end would be to create another mount_firmware_directory
function that has the correct mount properties and call this instead, something that queries the filesystem flags and sets these could then be added as a follow-up.
@mjg0 we have a merge request out where we are testing a fix for this. If you get time to test things with these changes applied on your end that would be useful.
libnvidia-container-1.8.0-rc.2
is now live with a fix for this behaviour.
Please see #111 (comment) for instructions on how to get access to this RC (or wait for the full release).
libnvidia-container-1.8.0
with a fix for this is now GA
Release notes here: https://github.com/NVIDIA/libnvidia-container/releases/tag/v1.8.0
When I run
enroot
with a container that uses GPUs on RHEL 7.9, a failure to mount some firmware derailsenroot
:The container in question is docker://pccl/containertest:0.1, but I wouldn't try downloading it unless you have a lot of storage.
I'm not sure if this is an issue with
enroot
or withnvidia-container-cli
, but I figured I'd post here first to get some context.