Hi! Libnvidia-container currently relies on glibc internals when locating the host system's libraries which limits its compatibility with the wider range of e.g. linux distributions. Nvidia-container-toolkit appears to provide a limited support for static configuration, e.g. the ModeCSV used for jetsons: https://github.com/NVIDIA/nvidia-container-toolkit/blob/a2262d00cc6d98ac2e95ae2f439e699a7d64dc17/pkg/nvcdi/lib.go#L98-L102, but many tools (e.g. apptainer and singularityCE) rely on libnvidia-docker directly. I think it's desirable that libnvidia-container (also) support static configuration, whereby the user would specify a list of search paths to look for the userspace driver libraries in, at build time or at runtime.
Motivation
From a glance, the stumbling stones seem to be as follows:
ldconfig is assumed to be aware of the userspace drivers' location (e.g. through a global /etc/ld.so.conf, which also may not exist);
/etc/ld.so.cache is assumed to exist, but it's not guaranteed to; ld.so.cache is specific to glibc, e.g. I'm not sure if such a concept exists for musl; while it's reasonable to limit the support to glibc (e.g. because NVidia only publishes the binaries built against that), even the systems that use glibc may not populate the global cache; it's safer to assume that it's an optional cache for speeding up the dynamic loader
Inspecting the dynamic loader's search paths and inferring the host system's libraries seems to be a valid need, and we probably should consult with glibc (and/or other libc implementations') maintainers as to how to approach it correctly. The optional /etc/ld.so.conf is only one of the tunables that affects the ld.so's behaviour, the others being e.g. LD_PRELOAD, LD_LIBRARY_PATH, DT_RUNPATH. Rather than try and approximate just a part of the dynamic loader's behaviour we should probably use the loader itself. The only "public" interfaces I'm currently aware of are dlopen()+dlinfo() (allows code execution, albeit with the same privileges the parent process already has anyway) and ld.so --list (requires a test elf binary as an argument). I think a ticket in glibc's issue tracker would be a reasonable step forward.
Hi! Libnvidia-container currently relies on glibc internals when locating the host system's libraries which limits its compatibility with the wider range of e.g. linux distributions. Nvidia-container-toolkit appears to provide a limited support for static configuration, e.g. the
ModeCSV
used for jetsons: https://github.com/NVIDIA/nvidia-container-toolkit/blob/a2262d00cc6d98ac2e95ae2f439e699a7d64dc17/pkg/nvcdi/lib.go#L98-L102, but many tools (e.g. apptainer and singularityCE) rely on libnvidia-docker directly. I think it's desirable that libnvidia-container (also) support static configuration, whereby the user would specify a list of search paths to look for the userspace driver libraries in, at build time or at runtime.Motivation
From a glance, the stumbling stones seem to be as follows:
ldconfig
is assumed to be aware of the userspace drivers' location (e.g. through a global/etc/ld.so.conf
, which also may not exist);/etc/ld.so.cache
is assumed to exist, but it's not guaranteed to;ld.so.cache
is specific to glibc, e.g. I'm not sure if such a concept exists formusl
; while it's reasonable to limit the support to glibc (e.g. because NVidia only publishes the binaries built against that), even the systems that useglibc
may not populate the global cache; it's safer to assume that it's an optional cache for speeding up the dynamic loaderld.so.cache
is assumed, which is a glibc internal and probably not part of its public interface; e.g.ldcache.c
replicates the header structure: https://github.com/NVIDIA/libnvidia-container/blob/5c75904f9cf41bd106a0424e6d24c2854ef94c11/src/ldcache.c#L46-L53Inspecting the dynamic loader's search paths and inferring the host system's libraries seems to be a valid need, and we probably should consult with glibc (and/or other libc implementations') maintainers as to how to approach it correctly. The optional
/etc/ld.so.conf
is only one of the tunables that affects theld.so
's behaviour, the others being e.g.LD_PRELOAD
,LD_LIBRARY_PATH
,DT_RUNPATH
. Rather than try and approximate just a part of the dynamic loader's behaviour we should probably use the loader itself. The only "public" interfaces I'm currently aware of aredlopen()
+dlinfo()
(allows code execution, albeit with the same privileges the parent process already has anyway) andld.so --list
(requires a test elf binary as an argument). I think a ticket in glibc's issue tracker would be a reasonable step forward.Cf. also https://github.com/apptainer/apptainer/issues/1894, https://github.com/NixOS/nixpkgs/pull/279235, https://github.com/NVIDIA/nvidia-container-toolkit/issues/71
Thanks!