NVIDIA / nvidia-container-toolkit

Build and run containers leveraging NVIDIA GPUs
Apache License 2.0
1.88k stars 214 forks source link

nvidia-ctk returns undefined symbol: nvmlComputeInstanceDestroy #557

Open davidshen84 opened 1 week ago

davidshen84 commented 1 week ago

Hi,

I build the source code using the Makefile without any changes. I pulled the 1.15.0 tag. The build script was executed successfully and outputted all the binary files.

But when I try to create the config file using nvidia-ctk, I get the following error.

> sudo nvidia-ctk --quiet config --config-file=/etc/nvidia-container-runtime/config.toml --in-place
nvidia-ctk: symbol lookup error: nvidia-ctk: undefined symbol: nvmlComputeInstanceDestroy

Here's the nvidia-smi output.

Sun Jun 23 18:34:38 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.90.07              Driver Version: 550.90.07      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce GTX 1050        Off |   00000000:01:00.0 Off |                  N/A |
| N/A   39C    P8             N/A / ERR!  |       0MiB /   4096MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

I know the nvidia driver version is a bit old. The latest is 555, but on Gentoo, the latest is still in 550 range.

I found this https://github.com/NVIDIA/nvidia-container-toolkit/issues/49, which mentioned a similar error but with a different undefined symbol. I wonder if they are related.