NVIDIA / nvidia-container-toolkit

Build and run containers leveraging NVIDIA GPUs
Apache License 2.0
2.46k stars 264 forks source link

failed to initalize NVML: ERROR_LIBRARY_NOT_FOUND #118

Open fifofonix opened 1 year ago

fifofonix commented 1 year ago

I'm experimenting with nvidia-container-toolkit on FedoraCoreOS and specifically using podman to run gpu workloads.

The new v1.14 nvidia-container-toolkit and nvidia-container-toolkit-base install fine with rpm-ostree now which is great.

But sudo nvidia-ctk --debug cdi generate to generate the CDI spec fails due to an inability to locate the NVML shared library.

Inspecting /etc/ld.so.conf.d/libnvidia-container-tools-1.4.0-1.x86_64.conf I see it indicates libraries in /usr/local/lib.

The new generic rpms do not install there for FedoraCoreOS (because they can't).

I am running the driver container with the recommended shared mounts so manually editing this file to ref its location, i.e. replacing with/run/nvidia/driver/usr/lib64 followed by an ldconfig to load shared libraries anew, I'm able to complete CDI spec generation.

It would seem that the generic rpm installation should somehow anticipate the location of the libraries or perhaps list several potential folder locations including this one that is used for the driver container.

klueska commented 1 year ago

If you are running with the driver container, then you need to point nvidia-ctk at the driver root, i.e.:

nvidia-ctk cdi generate --driver-root=/run/nvidia/driver

There are a number of other options available to help you navigate non-standard environments as well:

$ nvidia-ctk cdi generate --help
NAME:
   NVIDIA Container Toolkit CLI cdi generate - Generate CDI specifications for use with CDI-enabled runtimes

USAGE:
   NVIDIA Container Toolkit CLI cdi generate [command options] [arguments...]

OPTIONS:
   --output value                        Specify the file to output the generated CDI specification to. If this is '' the specification is output to STDOUT
   --format value                        The output format for the generated spec [json | yaml]. This overrides the format defined by the output file extension (if specified). (default: "yaml")
   --mode value, --discovery-mode value  The mode to use when discovering the available entities. One of [auto | nvml | wsl]. If mode is set to 'auto' the mode will be determined based on the system configuration. (default: "auto")
   --device-name-strategy value          Specify the strategy for generating device names. One of [index | uuid | type-index] (default: "index")
   --driver-root value                   Specify the NVIDIA GPU driver root to use when discovering the entities that should be included in the CDI specification.
   --library-search-path value           Specify the path to search for libraries when discovering the entities that should be included in the CDI specification.
                                         Note: This option only applies to CSV mode.
   --nvidia-ctk-path value               Specify the path to use for the nvidia-ctk in the generated CDI specification. If this is left empty, the path will be searched.
   --vendor value, --cdi-vendor value    the vendor string to use for the generated CDI specification. (default: "nvidia.com")
   --class value, --cdi-class value      the class string to use for the generated CDI specification. (default: "gpu")
   --csv.file value                      The path to the list of CSV files to use when generating the CDI specification in CSV mode. (default: "/etc/nvidia-container-runtime/host-files-for-container.d/devices.csv", "/etc/nvidia-container-runtime/host-files-for-container.d/drivers.csv", "/etc/nvidia-container-runtime/host-files-for-container.d/l4t.csv")
   --help, -h                            show help (default: false)
fifofonix commented 1 year ago

So, I'm still getting the same error even passing --library-search-path and --driver-root unfortunately.

I've also tried export LD_LIBRARY_PATH=/run/nvidia/driver/lib64 prior to the nvidia-ctk command.

It would seem on FedoraCoreOS at least that some manipulation of /etc//etc/ld.so.conf.d is required still.

On a newly provisioned system that installs nvidia-container-toolkit via rpm-ostree and is running driver container successfully as witnessed by nvidia-smi output:

$ sudo nvidia-ctk --debug cdi generate --driver-root=/run/nvidia/driver --library-search-path /run/nvidia/driver/usr/lib64
DEBU[0000] Locating NVIDIA Container Toolkit CLI as nvidia-ctk
DEBU[0000] Checking candidate '/usr/bin/nvidia-ctk'
DEBU[0000] Found 1 candidates; ignoring further candidates
DEBU[0000] Found nvidia-ctk candidates: [/usr/bin/nvidia-ctk]
DEBU[0000] Using NVIDIA Container Toolkit CLI path nvidia-ctk
DEBU[0000] Is WSL-based system? false: could not load DXCore library: libdxcore.so: cannot open shared object file: No such file or directory
DEBU[0000] Is NVML-based system? false: could not load NVML library: libnvidia-ml.so.1: cannot open shared object file: No such file or directory
DEBU[0000] Is Tegra-based system? false: /sys/devices/soc0/family file not found
INFO[0000] Auto-detected mode as "nvml"
ERRO[0000] failed to generate CDI spec: failed to create device CDI specs: failed to initalize NVML: ERROR_LIBRARY_NOT_FOUND