canonical / lxd

Powerful system container and virtual machine manager
https://canonical.com/lxd
GNU Affero General Public License v3.0
4.32k stars 926 forks source link

CUDA failed to initialize. #11804

Closed ccjincong closed 1 year ago

ccjincong commented 1 year ago

Required information

Issue description

I used the lxd to create a contain. Inside the lxd_contain I use the Nvidia NGC https://catalog.ngc.nvidia.com/orgs/nvidia/containers/pytorch to create another contain. But when I use docker run --gpus all --name test1 -it --runtime=nvidia nvcr.io/nvidia/pytorch:23.05-py3,get into the contain. It shows that


ERROR: The NVIDIA Driver is present, but CUDA failed to initialize.  GPU functionality will not be available.
   [[ Initialization error (error 3) ]]

and use torch.cuda.is_available()

>>> torch.cuda.is_available()
/usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py:115: UserWarning: CUDA initialization: CUDA driver initializati                                                                                                on failed, you might not have a CUDA gpu. (Triggered internally at /opt/pytorch/pytorch/c10/cuda/CUDAFunctions.cpp:108.)
  return torch._C._cuda_getDeviceCount() > 0
False
>>>

I run nvidia-smi in lxd_contain,it output

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.41.03              Driver Version: 530.41.03    CUDA Version: 12.1     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                  Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf            Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A100-SXM4-80GB           Off| 00000000:40:00.0 Off |                   On |
| N/A   27C    P0               31W / 400W|      0MiB / 81920MiB |     N/A      Default |
|                                         |                      |              Enabled |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA A100-SXM4-80GB           Off| 00000000:B1:00.0 Off |                   On |
| N/A   26C    P0               36W / 400W|      0MiB / 81920MiB |     N/A      Default |
|                                         |                      |              Enabled |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| MIG devices:                                                                          |
+------------------+--------------------------------+-----------+-----------------------+
| GPU  GI  CI  MIG |                   Memory-Usage |        Vol|      Shared           |
|      ID  ID  Dev |                     BAR1-Usage | SM     Unc| CE ENC DEC OFA JPG|
|                  |                                |        ECC|                       |
|==================+================================+===========+=======================|
|  No MIG devices found                                                                 |
+---------------------------------------------------------------------------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

I run nvcc --version in docker_contain ,it show

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Tue_Feb__7_19:32:13_PST_2023
Cuda compilation tools, release 12.1, V12.1.66
Build cuda_12.1.r12.1/compiler.32415258_0

I run nvidia-smi in docker_contain, and it output

Wed Jun  7 12:17:34 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.41.03              Driver Version: 530.41.03    CUDA Version: 12.1     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                  Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf            Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A100-SXM4-80GB           Off| 00000000:40:00.0 Off |                   On |
| N/A   27C    P0               31W / 400W|      0MiB / 81920MiB |     N/A      Default |
|                                         |                      |              Enabled |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA A100-SXM4-80GB           Off| 00000000:B1:00.0 Off |                   On |
| N/A   26C    P0               36W / 400W|      0MiB / 81920MiB |     N/A      Default |
|                                         |                      |              Enabled |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| MIG devices:                                                                          |
+------------------+--------------------------------+-----------+-----------------------+
| GPU  GI  CI  MIG |                   Memory-Usage |        Vol|      Shared           |
|      ID  ID  Dev |                     BAR1-Usage | SM     Unc| CE ENC DEC OFA JPG|
|                  |                                |        ECC|                       |
|==================+================================+===========+=======================|
|  No MIG devices found                                                                 |
+---------------------------------------------------------------------------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

I run nvcc --version in docker_contain ,it show

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Mon_Apr__3_17:16:06_PDT_2023
Cuda compilation tools, release 12.1, V12.1.105
Build cuda_12.1.r12.1/compiler.32688072_0
Besides, I run this contain in a lxc contain.

Information to attach

Resources: Processes: 156 Disk usage: root: 550.46GiB CPU usage: CPU usage (in seconds): 6233 Memory usage: Memory (current): 2.87GiB Memory (peak): 3.36GiB Network usage: docker0: Type: broadcast State: UP MAC address: 02:42:99:a9:b4:65 MTU: 1500 Bytes received: 625.32kB Bytes sent: 32.49MB Packets received: 12895 Packets sent: 21297 IP addresses: inet: 172.17.0.1/16 (global) inet6: fe80::42:99ff:fea9:b465/64 (link) eth0: Type: broadcast State: UP Host interface: veth7570b1bc MAC address: 00:16:3e:99:81:19 MTU: 1500 Bytes received: 12.64GB Bytes sent: 410.08MB Packets received: 8441356 Packets sent: 5625534 IP addresses: inet: 10.124.188.222/24 (global) inet6: fd42:eb5f:d57b:3c13:216:3eff:fe99:8119/64 (global) inet6: fe80::216:3eff:fe99:8119/64 (link) lo: Type: loopback State: UP MTU: 65536 Bytes received: 120.83kB Bytes sent: 120.83kB Packets received: 1018 Packets sent: 1018 IP addresses: inet: 127.0.0.1/8 (local) inet6: ::1/128 (local)

Log:

lxc chenjincong 20230607040824.928 ERROR conf - ../src/src/lxc/conf.c:turn_into_dependent_mounts:3948 - No such file or directory - Failed to recursively turn old root mount tree into dependent mount. Continuing...

 - [ ] Main daemon log (at /var/log/lxd/lxd.log or /var/snap/lxd/common/lxd/logs/lxd.log)

architecture: x86_64 config: image.architecture: amd64 image.description: Ubuntu focal amd64 (20230602_07:43) image.name: ubuntu-focal-amd64-default-20230602_07:43 image.os: ubuntu image.release: focal image.serial: "20230602_07:43" image.variant: default security.nesting: "true" security.privileged: "true" security.syscalls.intercept.mknod: "true" security.syscalls.intercept.setxattr: "true" volatile.base_image: 88b26c8cd8737818c062f547b1f7cb472ed3dc82bd66bcf95779dff4ae6cc5c5 volatile.cloud-init.instance-id: 217fc8fc-978e-4e85-a543-a408c4b9ca41 volatile.eth0.host_name: veth7570b1bc volatile.eth0.hwaddr: 00:16:3e:99:81:19 volatile.idmap.base: "0" volatile.idmap.current: '[]' volatile.idmap.next: '[]' volatile.last_state.idmap: '[]' volatile.last_state.power: RUNNING volatile.last_state.ready: "false" volatile.uuid: 053d9a82-df16-4fd3-8184-96d47cb1f799 volatile.uuid.generation: 053d9a82-df16-4fd3-8184-96d47cb1f799 devices: eth0: name: eth0 network: lxdbr0 type: nic gpu0: gputype: physical pci: "40:00.0" type: gpu gpu1: gputype: physical pci: B1:00.0 type: gpu root: path: / pool: zfs-pool size: 30TB type: disk ephemeral: false profiles:

stgraber commented 1 year ago

This is unlikely to be something that we can trivially help with as it looks like the LXD container itself is behaving as expected. So it's likely something that's going wrong with nvidia-docker for your nested container. We're obviously not Docker or NVIDIA experts so don't really know how to debug this further.

If it were me, I'd find a way to hook strace up to the failing process so you can get an idea of any kernel calls it's making which may be failing. Whether it's bad file permissions or something being outright denied by the kernel.

ccjincong commented 1 year ago

I just tried to disable MIG. And it just worked. I will turn to NVIDIA for help