Open fifofonix opened 1 year ago
If you are running with the driver container, then you need to point nvidia-ctk
at the driver root, i.e.:
nvidia-ctk cdi generate --driver-root=/run/nvidia/driver
There are a number of other options available to help you navigate non-standard environments as well:
$ nvidia-ctk cdi generate --help
NAME:
NVIDIA Container Toolkit CLI cdi generate - Generate CDI specifications for use with CDI-enabled runtimes
USAGE:
NVIDIA Container Toolkit CLI cdi generate [command options] [arguments...]
OPTIONS:
--output value Specify the file to output the generated CDI specification to. If this is '' the specification is output to STDOUT
--format value The output format for the generated spec [json | yaml]. This overrides the format defined by the output file extension (if specified). (default: "yaml")
--mode value, --discovery-mode value The mode to use when discovering the available entities. One of [auto | nvml | wsl]. If mode is set to 'auto' the mode will be determined based on the system configuration. (default: "auto")
--device-name-strategy value Specify the strategy for generating device names. One of [index | uuid | type-index] (default: "index")
--driver-root value Specify the NVIDIA GPU driver root to use when discovering the entities that should be included in the CDI specification.
--library-search-path value Specify the path to search for libraries when discovering the entities that should be included in the CDI specification.
Note: This option only applies to CSV mode.
--nvidia-ctk-path value Specify the path to use for the nvidia-ctk in the generated CDI specification. If this is left empty, the path will be searched.
--vendor value, --cdi-vendor value the vendor string to use for the generated CDI specification. (default: "nvidia.com")
--class value, --cdi-class value the class string to use for the generated CDI specification. (default: "gpu")
--csv.file value The path to the list of CSV files to use when generating the CDI specification in CSV mode. (default: "/etc/nvidia-container-runtime/host-files-for-container.d/devices.csv", "/etc/nvidia-container-runtime/host-files-for-container.d/drivers.csv", "/etc/nvidia-container-runtime/host-files-for-container.d/l4t.csv")
--help, -h show help (default: false)
So, I'm still getting the same error even passing --library-search-path
and --driver-root
unfortunately.
I've also tried export LD_LIBRARY_PATH=/run/nvidia/driver/lib64
prior to the nvidia-ctk
command.
It would seem on FedoraCoreOS at least that some manipulation of /etc//etc/ld.so.conf.d
is required still.
On a newly provisioned system that installs nvidia-container-toolkit
via rpm-ostree and is running driver container successfully as witnessed by nvidia-smi
output:
$ sudo nvidia-ctk --debug cdi generate --driver-root=/run/nvidia/driver --library-search-path /run/nvidia/driver/usr/lib64
DEBU[0000] Locating NVIDIA Container Toolkit CLI as nvidia-ctk
DEBU[0000] Checking candidate '/usr/bin/nvidia-ctk'
DEBU[0000] Found 1 candidates; ignoring further candidates
DEBU[0000] Found nvidia-ctk candidates: [/usr/bin/nvidia-ctk]
DEBU[0000] Using NVIDIA Container Toolkit CLI path nvidia-ctk
DEBU[0000] Is WSL-based system? false: could not load DXCore library: libdxcore.so: cannot open shared object file: No such file or directory
DEBU[0000] Is NVML-based system? false: could not load NVML library: libnvidia-ml.so.1: cannot open shared object file: No such file or directory
DEBU[0000] Is Tegra-based system? false: /sys/devices/soc0/family file not found
INFO[0000] Auto-detected mode as "nvml"
ERRO[0000] failed to generate CDI spec: failed to create device CDI specs: failed to initalize NVML: ERROR_LIBRARY_NOT_FOUND
I'm experimenting with
nvidia-container-toolkit
on FedoraCoreOS and specifically using podman to run gpu workloads.The new v1.14
nvidia-container-toolkit
andnvidia-container-toolkit-base
install fine withrpm-ostree
now which is great.But
sudo nvidia-ctk --debug cdi generate
to generate the CDI spec fails due to an inability to locate the NVML shared library.Inspecting
/etc/ld.so.conf.d/libnvidia-container-tools-1.4.0-1.x86_64.conf
I see it indicates libraries in/usr/local/lib
.The new generic rpms do not install there for FedoraCoreOS (because they can't).
I am running the driver container with the recommended shared mounts so manually editing this file to ref its location, i.e. replacing with
/run/nvidia/driver/usr/lib64
followed by anldconfig
to load shared libraries anew, I'm able to complete CDI spec generation.It would seem that the generic rpm installation should somehow anticipate the location of the libraries or perhaps list several potential folder locations including this one that is used for the driver container.