Open greenmoss opened 10 months ago
There's a script lxc/hooks/nvidia
which executes using the device name. However the device name is provided by something else before that script starts. Is this the source of the device name?
Notice we don't use the gpu.Nvidia.UUID
at all anymore. Instead we need to use the MIG UUID, which is a completely different UUID. Note the difference:
# nvidia-smi -L
GPU 0: NVIDIA A100 80GB PCIe (UUID: GPU-932bfb2c-849a-2a22-1e84-9eff6bc02b39)
MIG 3g.40gb Device 0: (UUID: MIG-0929581a-2be6-5dec-bf34-0954c6c529a6)
MIG 3g.40gb Device 1: (UUID: MIG-3adbed3a-d03d-549f-a016-ba1fd7359f23)
@simondeziel do any of the testflinger machines have MIG cards? is this something you could take a look at?
@greenmoss IIRC, our MIG tests all passed on a system with 2x NVIDIA-A100 cards. Our test environment differs a bit from yours though:
I'll rerun the tests and see if I can reproduce your issue.
@greenmoss I can reproduce your issue when using mig.gi
and mig.ci
:
# lxc config device add nvidia-mig1 gpu0 gpu gputype=mig mig.ci=0 mig.gi=5 pci=0000:21:00.0
# lxc config show nvidia-mig1
architecture: x86_64
config:
image.architecture: amd64
image.description: ubuntu 22.04 LTS amd64 (daily) (20240115)
image.label: daily
image.os: ubuntu
image.release: jammy
image.serial: "20240115"
image.type: squashfs
image.version: "22.04"
nvidia.runtime: "true"
volatile.base_image: 392a4b10b54c6a2fa65b19eb7225862e0cd923d067629ddf1d4b53249645154f
volatile.cloud-init.instance-id: e76c449a-a7be-4d76-8079-3c63d7689b98
volatile.eth0.hwaddr: 00:16:3e:e5:2d:2f
volatile.idmap.base: "0"
volatile.idmap.current: '[{"Isuid":true,"Isgid":false,"Hostid":1000000,"Nsid":0,"Maprange":1000000000},{"Isuid":false,"Isgid":true,"Hostid":1000000,"Nsid":0,"Maprange":1000000000}]'
volatile.idmap.next: '[{"Isuid":true,"Isgid":false,"Hostid":1000000,"Nsid":0,"Maprange":1000000000},{"Isuid":false,"Isgid":true,"Hostid":1000000,"Nsid":0,"Maprange":1000000000}]'
volatile.last_state.idmap: '[{"Isuid":true,"Isgid":false,"Hostid":1000000,"Nsid":0,"Maprange":1000000000},{"Isuid":false,"Isgid":true,"Hostid":1000000,"Nsid":0,"Maprange":1000000000}]'
volatile.last_state.power: STOPPED
volatile.last_state.ready: "false"
volatile.uuid: d93a6f09-9d8f-4a46-a673-56d725ae5dd0
volatile.uuid.generation: d93a6f09-9d8f-4a46-a673-56d725ae5dd0
devices:
gpu0:
gputype: mig
mig.ci: "0"
mig.gi: "5"
pci: "0000:21:00.0"
type: gpu
ephemeral: false
profiles:
- default
stateful: false
description: ""
# lxc start nvidia-mig1
Error: Failed to run: /snap/lxd/current/bin/lxd forkstart nvidia-mig1 /var/snap/lxd/common/lxd/containers /var/snap/lxd/common/lxd/logs/nvidia-mig1/lxc.conf: exit status 1
Try `lxc info --show-log nvidia-mig1` for more info
It fails due to the bogus MIG-GPU
and mig.gi
/mig.ci
bits at the end of the device:
# lxc info --show-log nvidia-mig1 | grep hooks/nvidia
lxc nvidia-mig1 20240117193850.896 INFO conf - ../src/src/lxc/conf.c:run_script_argv:341 - Executing script "/snap/lxd/current/lxc/hooks/nvidia" for container "nvidia-mig1"
lxc nvidia-mig1 20240117193850.910 DEBUG conf - ../src/src/lxc/conf.c:run_buffer:311 - Script exec /snap/lxd/current/lxc/hooks/nvidia produced output: mkdir: cannot create directory ‘/var/snap/lxd/common/lxd/storage-pools/default/containers/nvidia-mig1/hook’
lxc nvidia-mig1 20240117193850.910 DEBUG conf - ../src/src/lxc/conf.c:run_buffer:311 - Script exec /snap/lxd/current/lxc/hooks/nvidia produced output: : Permission denied
lxc nvidia-mig1 20240117193850.912 DEBUG conf - ../src/src/lxc/conf.c:run_buffer:311 - Script exec /snap/lxd/current/lxc/hooks/nvidia produced output: + exec nvidia-container-cli --user configure --no-cgroups --ldconfig=@/usr/sbin/ldconfig.real --device=MIG-GPU-2a956425-8ac7-bb38-eb07-be255e4fe341/5/0 --compute --utility --require= --require= /var/snap/lxd/common/lxc/
lxc nvidia-mig1 20240117193851.109 DEBUG conf - ../src/src/lxc/conf.c:run_buffer:311 - Script exec /snap/lxd/current/lxc/hooks/nvidia produced output: nvidia-container-cli.real: device error: MIG-GPU-2a956425-8ac7-bb38-eb07-be255e4fe341/5/0: unknown device
Referring to the same MIG device but using it's mig.uuid
works:
devices:
gpu0:
gputype: mig
mig.uuid: MIG-5eb55142-9ba3-5122-b2c5-6945ea7dce1a
pci: "0000:21:00.0"
type: gpu
# nvidia-smi -L
GPU 0: NVIDIA A100-PCIE-40GB (UUID: GPU-2a956425-8ac7-bb38-eb07-be255e4fe341)
MIG 1c.2g.10gb Device 0: (UUID: MIG-5eb55142-9ba3-5122-b2c5-6945ea7dce1a)
MIG 1c.2g.10gb Device 1: (UUID: MIG-c0480516-1019-5980-af70-469f3ac79a1c)
MIG 1g.5gb Device 2: (UUID: MIG-00cf968c-3639-52a8-b47a-9aca6124e1bd)
MIG 1g.5gb Device 3: (UUID: MIG-12c4d833-f137-51dc-b1d3-bf0a15afb9e1)
For this GPU, the 535 version is recommended:
# ubuntu-drivers devices
ERROR:root:aplay command not found
== /sys/devices/pci0000:20/0000:20:03.1/0000:21:00.0 ==
modalias : pci:v000010DEd000020F1sv000010DEsd0000145Fbc03sc02i00
vendor : NVIDIA Corporation
model : GA100 [A100 PCIe 40GB]
driver : nvidia-driver-535-server - distro non-free
driver : nvidia-driver-450-server - distro non-free
driver : nvidia-driver-525-server - distro non-free
driver : nvidia-driver-525 - distro non-free
driver : nvidia-driver-535-open - distro non-free
driver : nvidia-driver-470-server - distro non-free
driver : nvidia-driver-470 - distro non-free
driver : nvidia-driver-525-open - distro non-free
driver : nvidia-driver-535 - distro non-free recommended
driver : nvidia-driver-535-server-open - distro non-free
driver : xserver-xorg-video-nouveau - distro free builtin
I tried with the 525 driver, the mig.gi
and mig.ci
device assignment doesn't work either:
lxc nvidia-mig1 20240117200208.478 DEBUG conf - ../src/src/lxc/conf.c:run_buffer:311 - Script exec /snap/lxd/current/lxc/hooks/nvidia produced output: + exec nvidia-container-cli --user configure --no-cgroups --ldconfig=@/usr/sbin/ldconfig.real --device=MIG-GPU-2a956425-8ac7-bb38-eb07-be255e4fe341/5/0 --compute --utility --require= --require= /var/snap/lxd/common/lxc/
lxc nvidia-mig1 20240117200208.627 DEBUG conf - ../src/src/lxc/conf.c:run_buffer:311 - Script exec /snap/lxd/current/lxc/hooks/nvidia produced output: nvidia-container-cli.real: device error: MIG-GPU-2a956425-8ac7-bb38-eb07-be255e4fe341/5/0: unknown device
lxc nvidia-mig1 20240117200208.639 ERROR conf - ../src/src/lxc/conf.c:run_buffer:322 - Script exited with status 1
# nvidia-smi -L
GPU 0: NVIDIA A100-PCIE-40GB (UUID: GPU-2a956425-8ac7-bb38-eb07-be255e4fe341)
MIG 1c.2g.10gb Device 0: (UUID: MIG-5eb55142-9ba3-5122-b2c5-6945ea7dce1a)
MIG 1c.2g.10gb Device 1: (UUID: MIG-c0480516-1019-5980-af70-469f3ac79a1c)
MIG 1g.5gb Device 2: (UUID: MIG-00cf968c-3639-52a8-b47a-9aca6124e1bd)
MIG 1g.5gb Device 3: (UUID: MIG-12c4d833-f137-51dc-b1d3-bf0a15afb9e1)
I assumed it was due to the bogus MIG-
prefix added by LXD so I overrode the /snap/lxd/current/lxc/hooks/nvidia
script:
# diff -Naur /snap/lxd/current/lxc/hooks/nvidia /root/nvidia
--- /snap/lxd/current/lxc/hooks/nvidia 2024-01-17 18:54:38.000000000 +0000
+++ /root/nvidia 2024-01-17 20:14:45.914535184 +0000
@@ -244,6 +244,7 @@
fi
if [ -n "${CLI_DEVICES}" ] && [ "${CLI_DEVICES}" != "none" ]; then
+ CLI_DEVICES="$(echo "${CLI_DEVICES}" | sed 's/MIG-GPU/GPU/g')"
configure_args+=(--device="${CLI_DEVICES}")
fi
# mount -o bind,ro ~/nvidia /snap/lxd/current/lxc/hooks/nvidia
But that still didn't help:
lxc nvidia-mig1 20240117201503.241 DEBUG conf - ../src/src/lxc/conf.c:run_buffer:311 - Script exec /snap/lxd/current/lxc/hooks/nvidia produced output: + exec nvidia-container-cli --user configure --no-cgroups --ldconfig=@/usr/sbin/ldconfig.real --device=GPU-2a956425-8ac7-bb38-eb07-be255e4fe341/5/0 --compute --utility --require= --require= /var/snap/lxd/common/lxc/
lxc nvidia-mig1 20240117201503.391 DEBUG conf - ../src/src/lxc/conf.c:run_buffer:311 - Script exec /snap/lxd/current/lxc/hooks/nvidia produced output: nvidia-container-cli.real: device error: GPU-2a956425-8ac7-bb38-eb07-be255e4fe341/5/0: unknown device
@tomponline, based on https://github.com/NVIDIA/nvidia-container-toolkit/issues/203#issuecomment-1882768226 and https://github.com/canonical/lxd-ci/commit/05b3d2c3d3f02508b4318bbdd067da47dc4523fd maybe we should simply retire/remove support for mig.gi
and mig.ci
?
Required information
Issue description
Starting up a container with device type gpu on nvidia driver version 530 fails. The nvidia script attempts to launch using
--device
flag for example:However in the latest nvidia driver, this flag format has changed to MIG UUID only, for example:
References:
Steps to reproduce
nvidia-smi -L
nvidia.runtime: "true"
nvidia
utility output, via LXC configraw.lxc | lxc.log.level=debug
gputype: mig
lxc info --show-log
exec nvidia-container-cli
Information to attach
dmesg
)lxc info NAME --show-log
)lxc config show NAME --expanded
)lxc monitor
while reproducing the issue)