Closed ccjincong closed 1 year ago
This is unlikely to be something that we can trivially help with as it looks like the LXD container itself is behaving as expected. So it's likely something that's going wrong with nvidia-docker for your nested container. We're obviously not Docker or NVIDIA experts so don't really know how to debug this further.
If it were me, I'd find a way to hook strace
up to the failing process so you can get an idea of any kernel calls it's making which may be failing. Whether it's bad file permissions or something being outright denied by the kernel.
I just tried to disable MIG. And it just worked. I will turn to NVIDIA for help
Required information
Issue description
I used the lxd to create a contain. Inside the lxd_contain I use the Nvidia NGC
https://catalog.ngc.nvidia.com/orgs/nvidia/containers/pytorch
to create another contain. But when I usedocker run --gpus all --name test1 -it --runtime=nvidia nvcr.io/nvidia/pytorch:23.05-py3
,get into the contain. It shows thatand use torch.cuda.is_available()
I run nvidia-smi in lxd_contain,it output
I run nvcc --version in docker_contain ,it show
I run nvidia-smi in docker_contain, and it output
I run nvcc --version in docker_contain ,it show
Information to attach
dmesg
)lxc info NAME --show-log
)Resources: Processes: 156 Disk usage: root: 550.46GiB CPU usage: CPU usage (in seconds): 6233 Memory usage: Memory (current): 2.87GiB Memory (peak): 3.36GiB Network usage: docker0: Type: broadcast State: UP MAC address: 02:42:99:a9:b4:65 MTU: 1500 Bytes received: 625.32kB Bytes sent: 32.49MB Packets received: 12895 Packets sent: 21297 IP addresses: inet: 172.17.0.1/16 (global) inet6: fe80::42:99ff:fea9:b465/64 (link) eth0: Type: broadcast State: UP Host interface: veth7570b1bc MAC address: 00:16:3e:99:81:19 MTU: 1500 Bytes received: 12.64GB Bytes sent: 410.08MB Packets received: 8441356 Packets sent: 5625534 IP addresses: inet: 10.124.188.222/24 (global) inet6: fd42:eb5f:d57b:3c13:216:3eff:fe99:8119/64 (global) inet6: fe80::216:3eff:fe99:8119/64 (link) lo: Type: loopback State: UP MTU: 65536 Bytes received: 120.83kB Bytes sent: 120.83kB Packets received: 1018 Packets sent: 1018 IP addresses: inet: 127.0.0.1/8 (local) inet6: ::1/128 (local)
Log:
lxc chenjincong 20230607040824.928 ERROR conf - ../src/src/lxc/conf.c:turn_into_dependent_mounts:3948 - No such file or directory - Failed to recursively turn old root mount tree into dependent mount. Continuing...
architecture: x86_64 config: image.architecture: amd64 image.description: Ubuntu focal amd64 (20230602_07:43) image.name: ubuntu-focal-amd64-default-20230602_07:43 image.os: ubuntu image.release: focal image.serial: "20230602_07:43" image.variant: default security.nesting: "true" security.privileged: "true" security.syscalls.intercept.mknod: "true" security.syscalls.intercept.setxattr: "true" volatile.base_image: 88b26c8cd8737818c062f547b1f7cb472ed3dc82bd66bcf95779dff4ae6cc5c5 volatile.cloud-init.instance-id: 217fc8fc-978e-4e85-a543-a408c4b9ca41 volatile.eth0.host_name: veth7570b1bc volatile.eth0.hwaddr: 00:16:3e:99:81:19 volatile.idmap.base: "0" volatile.idmap.current: '[]' volatile.idmap.next: '[]' volatile.last_state.idmap: '[]' volatile.last_state.power: RUNNING volatile.last_state.ready: "false" volatile.uuid: 053d9a82-df16-4fd3-8184-96d47cb1f799 volatile.uuid.generation: 053d9a82-df16-4fd3-8184-96d47cb1f799 devices: eth0: name: eth0 network: lxdbr0 type: nic gpu0: gputype: physical pci: "40:00.0" type: gpu gpu1: gputype: physical pci: B1:00.0 type: gpu root: path: / pool: zfs-pool size: 30TB type: disk ephemeral: false profiles: