canonical / lxd

Powerful system container and virtual machine manager
https://canonical.com/lxd
GNU Affero General Public License v3.0
4.38k stars 931 forks source link

NVIDIA GPU assignment with `mig.gi`/`mig.ci` doesn't work #12723

Open greenmoss opened 10 months ago

greenmoss commented 10 months ago

Required information

Issue description

Starting up a container with device type gpu on nvidia driver version 530 fails. The nvidia script attempts to launch using --device flag for example:

--device=MIG-GPU-932bfb2c-849a-2a22-1e84-9eff6bc02b39/1/0

However in the latest nvidia driver, this flag format has changed to MIG UUID only, for example:

--device=MIG-3adbed3a-d03d-549f-a016-ba1fd7359f23

References:

Steps to reproduce

  1. Set up an LXD server with nvidia MIG
  2. Verify you see MIG devices, e.g. from nvidia-smi -L
  3. Define an LXC container with lxc config nvidia.runtime: "true"
  4. Also ensure you can see nvidia utility output, via LXC config raw.lxc | lxc.log.level=debug
  5. Add a gpu device the the LXC container with gputype: mig
  6. Start the container
  7. Look in the logs: lxc info --show-log
  8. Search for start failure: exec nvidia-container-cli

Information to attach

greenmoss commented 10 months ago

dmesg.log lxc.log lxc_config.txt lxd.log lxc-monitor.log

greenmoss commented 10 months ago

There's a script lxc/hooks/nvidia which executes using the device name. However the device name is provided by something else before that script starts. Is this the source of the device name?

https://github.com/canonical/lxd/blob/38ae187ff3d62f80038d3297839a63e6b90dfed2/lxd/device/gpu_mig.go#L96

Notice we don't use the gpu.Nvidia.UUID at all anymore. Instead we need to use the MIG UUID, which is a completely different UUID. Note the difference:

# nvidia-smi -L
GPU 0: NVIDIA A100 80GB PCIe (UUID: GPU-932bfb2c-849a-2a22-1e84-9eff6bc02b39)
  MIG 3g.40gb     Device  0: (UUID: MIG-0929581a-2be6-5dec-bf34-0954c6c529a6)
  MIG 3g.40gb     Device  1: (UUID: MIG-3adbed3a-d03d-549f-a016-ba1fd7359f23)
tomponline commented 9 months ago

@simondeziel do any of the testflinger machines have MIG cards? is this something you could take a look at?

simondeziel commented 9 months ago

@greenmoss IIRC, our MIG tests all passed on a system with 2x NVIDIA-A100 cards. Our test environment differs a bit from yours though:

I'll rerun the tests and see if I can reproduce your issue.

simondeziel commented 9 months ago

@greenmoss I can reproduce your issue when using mig.gi and mig.ci:

# lxc config device add nvidia-mig1 gpu0 gpu gputype=mig mig.ci=0 mig.gi=5 pci=0000:21:00.0

# lxc config show nvidia-mig1
architecture: x86_64
config:
  image.architecture: amd64
  image.description: ubuntu 22.04 LTS amd64 (daily) (20240115)
  image.label: daily
  image.os: ubuntu
  image.release: jammy
  image.serial: "20240115"
  image.type: squashfs
  image.version: "22.04"
  nvidia.runtime: "true"
  volatile.base_image: 392a4b10b54c6a2fa65b19eb7225862e0cd923d067629ddf1d4b53249645154f
  volatile.cloud-init.instance-id: e76c449a-a7be-4d76-8079-3c63d7689b98
  volatile.eth0.hwaddr: 00:16:3e:e5:2d:2f
  volatile.idmap.base: "0"
  volatile.idmap.current: '[{"Isuid":true,"Isgid":false,"Hostid":1000000,"Nsid":0,"Maprange":1000000000},{"Isuid":false,"Isgid":true,"Hostid":1000000,"Nsid":0,"Maprange":1000000000}]'
  volatile.idmap.next: '[{"Isuid":true,"Isgid":false,"Hostid":1000000,"Nsid":0,"Maprange":1000000000},{"Isuid":false,"Isgid":true,"Hostid":1000000,"Nsid":0,"Maprange":1000000000}]'
  volatile.last_state.idmap: '[{"Isuid":true,"Isgid":false,"Hostid":1000000,"Nsid":0,"Maprange":1000000000},{"Isuid":false,"Isgid":true,"Hostid":1000000,"Nsid":0,"Maprange":1000000000}]'
  volatile.last_state.power: STOPPED
  volatile.last_state.ready: "false"
  volatile.uuid: d93a6f09-9d8f-4a46-a673-56d725ae5dd0
  volatile.uuid.generation: d93a6f09-9d8f-4a46-a673-56d725ae5dd0
devices:
  gpu0:
    gputype: mig
    mig.ci: "0"
    mig.gi: "5"
    pci: "0000:21:00.0"
    type: gpu
ephemeral: false
profiles:
- default
stateful: false
description: ""

# lxc start nvidia-mig1
Error: Failed to run: /snap/lxd/current/bin/lxd forkstart nvidia-mig1 /var/snap/lxd/common/lxd/containers /var/snap/lxd/common/lxd/logs/nvidia-mig1/lxc.conf: exit status 1
Try `lxc info --show-log nvidia-mig1` for more info

It fails due to the bogus MIG-GPU and mig.gi/mig.ci bits at the end of the device:

# lxc info --show-log nvidia-mig1 | grep hooks/nvidia
lxc nvidia-mig1 20240117193850.896 INFO     conf - ../src/src/lxc/conf.c:run_script_argv:341 - Executing script "/snap/lxd/current/lxc/hooks/nvidia" for container "nvidia-mig1"
lxc nvidia-mig1 20240117193850.910 DEBUG    conf - ../src/src/lxc/conf.c:run_buffer:311 - Script exec /snap/lxd/current/lxc/hooks/nvidia produced output: mkdir: cannot create directory ‘/var/snap/lxd/common/lxd/storage-pools/default/containers/nvidia-mig1/hook’
lxc nvidia-mig1 20240117193850.910 DEBUG    conf - ../src/src/lxc/conf.c:run_buffer:311 - Script exec /snap/lxd/current/lxc/hooks/nvidia produced output: : Permission denied
lxc nvidia-mig1 20240117193850.912 DEBUG    conf - ../src/src/lxc/conf.c:run_buffer:311 - Script exec /snap/lxd/current/lxc/hooks/nvidia produced output: + exec nvidia-container-cli --user configure --no-cgroups --ldconfig=@/usr/sbin/ldconfig.real --device=MIG-GPU-2a956425-8ac7-bb38-eb07-be255e4fe341/5/0 --compute --utility --require= --require= /var/snap/lxd/common/lxc/
lxc nvidia-mig1 20240117193851.109 DEBUG    conf - ../src/src/lxc/conf.c:run_buffer:311 - Script exec /snap/lxd/current/lxc/hooks/nvidia produced output: nvidia-container-cli.real: device error: MIG-GPU-2a956425-8ac7-bb38-eb07-be255e4fe341/5/0: unknown device

Referring to the same MIG device but using it's mig.uuid works:

devices:
  gpu0:
    gputype: mig
    mig.uuid: MIG-5eb55142-9ba3-5122-b2c5-6945ea7dce1a
    pci: "0000:21:00.0"
    type: gpu
# nvidia-smi -L
GPU 0: NVIDIA A100-PCIE-40GB (UUID: GPU-2a956425-8ac7-bb38-eb07-be255e4fe341)
  MIG 1c.2g.10gb  Device  0: (UUID: MIG-5eb55142-9ba3-5122-b2c5-6945ea7dce1a)
  MIG 1c.2g.10gb  Device  1: (UUID: MIG-c0480516-1019-5980-af70-469f3ac79a1c)
  MIG 1g.5gb      Device  2: (UUID: MIG-00cf968c-3639-52a8-b47a-9aca6124e1bd)
  MIG 1g.5gb      Device  3: (UUID: MIG-12c4d833-f137-51dc-b1d3-bf0a15afb9e1)

For this GPU, the 535 version is recommended:

# ubuntu-drivers devices
ERROR:root:aplay command not found
== /sys/devices/pci0000:20/0000:20:03.1/0000:21:00.0 ==
modalias : pci:v000010DEd000020F1sv000010DEsd0000145Fbc03sc02i00
vendor   : NVIDIA Corporation
model    : GA100 [A100 PCIe 40GB]
driver   : nvidia-driver-535-server - distro non-free
driver   : nvidia-driver-450-server - distro non-free
driver   : nvidia-driver-525-server - distro non-free
driver   : nvidia-driver-525 - distro non-free
driver   : nvidia-driver-535-open - distro non-free
driver   : nvidia-driver-470-server - distro non-free
driver   : nvidia-driver-470 - distro non-free
driver   : nvidia-driver-525-open - distro non-free
driver   : nvidia-driver-535 - distro non-free recommended
driver   : nvidia-driver-535-server-open - distro non-free
driver   : xserver-xorg-video-nouveau - distro free builtin

I tried with the 525 driver, the mig.gi and mig.ci device assignment doesn't work either:

lxc nvidia-mig1 20240117200208.478 DEBUG    conf - ../src/src/lxc/conf.c:run_buffer:311 - Script exec /snap/lxd/current/lxc/hooks/nvidia produced output: + exec nvidia-container-cli --user configure --no-cgroups --ldconfig=@/usr/sbin/ldconfig.real --device=MIG-GPU-2a956425-8ac7-bb38-eb07-be255e4fe341/5/0 --compute --utility --require= --require= /var/snap/lxd/common/lxc/

lxc nvidia-mig1 20240117200208.627 DEBUG    conf - ../src/src/lxc/conf.c:run_buffer:311 - Script exec /snap/lxd/current/lxc/hooks/nvidia produced output: nvidia-container-cli.real: device error: MIG-GPU-2a956425-8ac7-bb38-eb07-be255e4fe341/5/0: unknown device

lxc nvidia-mig1 20240117200208.639 ERROR    conf - ../src/src/lxc/conf.c:run_buffer:322 - Script exited with status 1
# nvidia-smi -L
GPU 0: NVIDIA A100-PCIE-40GB (UUID: GPU-2a956425-8ac7-bb38-eb07-be255e4fe341)
  MIG 1c.2g.10gb  Device  0: (UUID: MIG-5eb55142-9ba3-5122-b2c5-6945ea7dce1a)
  MIG 1c.2g.10gb  Device  1: (UUID: MIG-c0480516-1019-5980-af70-469f3ac79a1c)
  MIG 1g.5gb      Device  2: (UUID: MIG-00cf968c-3639-52a8-b47a-9aca6124e1bd)
  MIG 1g.5gb      Device  3: (UUID: MIG-12c4d833-f137-51dc-b1d3-bf0a15afb9e1)

I assumed it was due to the bogus MIG- prefix added by LXD so I overrode the /snap/lxd/current/lxc/hooks/nvidia script:

# diff -Naur /snap/lxd/current/lxc/hooks/nvidia /root/nvidia
--- /snap/lxd/current/lxc/hooks/nvidia  2024-01-17 18:54:38.000000000 +0000
+++ /root/nvidia    2024-01-17 20:14:45.914535184 +0000
@@ -244,6 +244,7 @@
 fi

 if [ -n "${CLI_DEVICES}" ] && [ "${CLI_DEVICES}" != "none" ]; then
+    CLI_DEVICES="$(echo "${CLI_DEVICES}" | sed 's/MIG-GPU/GPU/g')"
     configure_args+=(--device="${CLI_DEVICES}")
 fi

# mount -o bind,ro ~/nvidia /snap/lxd/current/lxc/hooks/nvidia

But that still didn't help:

lxc nvidia-mig1 20240117201503.241 DEBUG    conf - ../src/src/lxc/conf.c:run_buffer:311 - Script exec /snap/lxd/current/lxc/hooks/nvidia produced output: + exec nvidia-container-cli --user configure --no-cgroups --ldconfig=@/usr/sbin/ldconfig.real --device=GPU-2a956425-8ac7-bb38-eb07-be255e4fe341/5/0 --compute --utility --require= --require= /var/snap/lxd/common/lxc/

lxc nvidia-mig1 20240117201503.391 DEBUG    conf - ../src/src/lxc/conf.c:run_buffer:311 - Script exec /snap/lxd/current/lxc/hooks/nvidia produced output: nvidia-container-cli.real: device error: GPU-2a956425-8ac7-bb38-eb07-be255e4fe341/5/0: unknown device

@tomponline, based on https://github.com/NVIDIA/nvidia-container-toolkit/issues/203#issuecomment-1882768226 and https://github.com/canonical/lxd-ci/commit/05b3d2c3d3f02508b4318bbdd067da47dc4523fd maybe we should simply retire/remove support for mig.gi and mig.ci?