Closed starry91 closed 1 year ago
You can get the MIG device by MIG-<MIG UUID>
or MIG-GPU-<GPU UUID>/<GI ID>/<CI ID>
. For example:
MIG-90443af0-2fe6-57fb-86fe-54186d5a6581
MIG-GPU-20bed2f5-819b-69ad-8d42-d3e1446080c1/3/0
are referring to the same MIG device 0:0
.
Ref: https://github.com/XuehaiPan/nvitop/blob/v1.0.0/nvitop/api/device.py#L220-L240
# https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#env-vars
# https://docs.nvidia.com/datacenter/tesla/mig-user-guide/index.html#cuda-visible-devices
# GPU UUID : `GPU-<GPU-UUID>`
# MIG UUID : `MIG-GPU-<GPU-UUID>/<GPU instance ID>/<compute instance ID>`
# MIG UUID (R470+): `MIG-<MIG-UUID>`
UUID_PATTERN = re.compile(
r"""^ # full match
(?:(?P<MigMode>MIG)-)? # prefix for MIG UUID
(?:(?P<GpuUuid>GPU)-)? # prefix for GPU UUID
(?(MigMode)|(?(GpuUuid)|GPU-)) # always have a prefix
(?P<UUID>[0-9a-f]{8}(?:-[0-9a-f]{4}){3}-[0-9a-f]{12}) # UUID for the GPU/MIG device in lower case
# Suffix for MIG device while using GPU UUID with GPU instance (GI) ID and compute instance (CI) ID
(?(MigMode) # match only when the MIG prefix matches
(?(GpuUuid) # match only when provide with GPU UUID
/(?P<GpuInstanceId>\d+) # GI ID of the MIG device
/(?P<ComputeInstanceId>\d+) # CI ID of the MIG device
|)
|)
$""", # full match
flags=re.VERBOSE,
)
Thanks @XuehaiPan, but my question was more around if there is way to get the value MIG UUID
from (GPU UUID
, GI ID
), i.e, 90443af0-2fe6-57fb-86fe-54186d5a6581
from (GPU-20bed2f5-819b-69ad-8d42-d3e1446080c1, 3)
? (I am using R470+)
FWIW, there is a GetMigDeviceHandleByIndex API, but that takes in the GI INDEX
and not the GI ID
but my question was more around if there is way to get the value
MIG UUID
from (GPU UUID
,GI ID
), i.e,90443af0-2fe6-57fb-86fe-54186d5a6581
from(GPU-20bed2f5-819b-69ad-8d42-d3e1446080c1, 3)
?
You cannot identify a MIG device without CI ID
because a GPU instance can have multiple compute instances, each is a different MIG device.
You can use:
MigDevice, _ := DeviceGetHandleByUUID("MIG-GPU-20bed2f5-819b-69ad-8d42-d3e1446080c1/3/0")
MigUuid, _ := MigDevice.GetUUID() // -> MIG-90443af0-2fe6-57fb-86fe-54186d5a6581
$ nvidia-smi
Thu Feb 2 15:30:15 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.78.01 Driver Version: 525.78.01 CUDA Version: 12.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA A100-PCI... On | 00000000:1D:00.0 Off | On |
| N/A 30C P0 31W / 250W | 45MiB / 40960MiB | N/A Default |
| | | Enabled |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| MIG devices: |
+------------------+----------------------+-----------+-----------------------+
| GPU GI CI MIG | Memory-Usage | Vol| Shared |
| ID ID Dev | BAR1-Usage | SM Unc| CE ENC DEC OFA JPG|
| | | ECC| |
|==================+======================+===========+=======================|
| 0 2 0 0 | 19MiB / 19968MiB | 42 0 | 3 0 2 0 0 |
| | 0MiB / 32767MiB | | |
+------------------+----------------------+-----------+-----------------------+
| 0 3 0 1 | 13MiB / 9856MiB | 28 0 | 2 0 1 0 0 |
| | 0MiB / 16383MiB | | |
+------------------+----------------------+-----------+-----------------------+
| 0 9 0 2 | 6MiB / 4864MiB | 14 0 | 1 0 0 0 0 |
| | 0MiB / 8191MiB | | |
+------------------+----------------------+-----------+-----------------------+
| 0 10 0 3 | 6MiB / 4864MiB | 14 0 | 1 0 0 0 0 |
| | 0MiB / 8191MiB | | |
+------------------+----------------------+-----------+-----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
$ nvidia-smi -L
GPU 0: NVIDIA A100-PCIE-40GB (UUID: GPU-c96ffde7-75ee-2f7e-255b-d34a594c752b)
MIG 3g.20gb Device 0: (UUID: MIG-418b3ff6-54a5-5bf6-915f-f5e43a266bc0)
MIG 2g.10gb Device 1: (UUID: MIG-53bd3992-6571-5d69-b5b3-0d9561284cc0)
MIG 1g.5gb Device 2: (UUID: MIG-0557e0ba-657b-5162-84eb-7c309eee8899)
MIG 1g.5gb Device 3: (UUID: MIG-4bb5a7a2-316f-5941-bc2b-9bea41260fb5)
In [1]: from nvitop import Device
In [2]: mig = Device("MIG-GPU-c96ffde7-75ee-2f7e-255b-d34a594c752b/2/0")
In [3]: mig
Out[3]: MigDevice(index=(0, 0), name="NVIDIA A100-PCIE-40GB MIG 3g.20gb", total_memory=19968MiB)
In [4]: mig.uuid()
Out[4]: 'MIG-418b3ff6-54a5-5bf6-915f-f5e43a266bc0'
While it should theoretically be possible to get a MIG device handle from the 3-tuple of (GPU device handle, GI, CI), there is unfortunately no direct NVML API for that. I've had an internal RFE open for this API for a while now, but no movement yet.
Thanks @XuehaiPan! That should help.
@klueska @XuehaiPan Is there a way to programmatically get all the (GI,CI) for GPU device by just knowing its the GPU index? I tried looking for a way to iterate over the GPUs and then get the GPU instances for it, but I could not find any relevant API.
Here is an example of walking all of the GIs currently created on a device: https://github.com/NVIDIA/mig-parted/blob/main/internal/nvlib/mig/mig.go#L67
And likewise all of the CIs currently created in a GI: https://github.com/NVIDIA/mig-parted/blob/main/internal/nvlib/mig/mig.go#L95
And here is an example of their usage together: https://github.com/NVIDIA/mig-parted/blob/main/pkg/mig/config/config.go#L80
Here is an example of walking all of the GIs currently created on a device: https://github.com/NVIDIA/mig-parted/blob/main/internal/nvlib/mig/mig.go#L67
@klueska Does GetGpuInstances require some elevated permissions? I seem to get ERROR_NO_PERMISSION when trying it out. Is that expected? Following is the stdout and code I was running.
Output:
INFO[0000] found 2 devices on host
ERRO[0000] error getting GPU instances for profile '0': 4
INFO[0000] MIG UUID -
INFO[0000] MIG UUID -
INFO[0000] MIG UUID -
INFO[0000] MIG UUID -
INFO[0000] MIG UUID -
INFO[0000] MIG UUID -
INFO[0000] MIG UUID -
ERRO[0000] error getting GPU instances for profile '1': 4
INFO[0000] MIG UUID -
INFO[0000] MIG UUID -
INFO[0000] MIG UUID -
ERRO[0000] error getting GPU instances for profile '2': 4
INFO[0000] MIG UUID -
INFO[0000] MIG UUID -
ERRO[0000] error getting GPU instances for profile '3': 4
INFO[0000] MIG UUID -
ERRO[0000] error getting GPU instances for profile '4': 4
INFO[0000] MIG UUID -
ERRO[0000] error getting GPU instances for profile '7': 4
INFO[0000] MIG UUID -
Code:
deviceCount, err := dcgm.GetAllDeviceCount()
if err != nil {
logrus.Fatal(err)
}
logrus.Infof("found %v devices on host", deviceCount)
for gpuIndex := 0; gpuIndex < int(deviceCount); gpuIndex++ {
device, err := nvml.DeviceGetHandleByIndex(gpuIndex)
if err != nvml.SUCCESS {
logrus.Fatal(err)
}
gpuUuid, err := device.GetUUID()
if err != nvml.SUCCESS {
logrus.Fatal(err)
}
for i := 0; i < nvml.GPU_INSTANCE_PROFILE_COUNT; i++ {
giProfileInfo, ret := device.GetGpuInstanceProfileInfo(i)
if ret == nvml.ERROR_NOT_SUPPORTED {
continue
}
if ret == nvml.ERROR_INVALID_ARGUMENT {
continue
}
if ret != nvml.SUCCESS {
logrus.Errorf("error getting GPU instance profile info for '%v': %v", i, ret)
}
gis, ret := device.GetGpuInstances(&giProfileInfo)
if ret != nvml.SUCCESS {
logrus.Errorf("error getting GPU instances for profile '%v': %v", i, ret)
}
for _, gi := range gis {
info, _ := gi.GetInfo()
uuid, _ := info.Device.GetUUID()
logrus.Infof("MIG UUID - %s", uuid)
}
}
}
@starry91 are you running the example in a container or with non-standard permissions. I believe a user needs access to the MIG monitor capability (/dev/nvidia-caps/nvidia-cap2
) to list the MIG devices available. (@klueska please correct me if I'm wrong).
Yes, it needs to be privileged: https://gitlab.com/nvidia/headers/cuda-individual/nvml_dev/-/blob/main/nvml.h#L8587
@klueska @elezar I am running it outside containers without any elevated permissions. Is there a way to get the information for a normal user(without elevated permissions)? I seem to get similar info from nvidia-smi
so guessing there must be a way.
$ nvidia-smi
Mon Feb 6 08:46:30 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 520.61.05 Driver Version: 520.61.05 CUDA Version: 11.8 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA A100 80G... On | 00000000:31:00.0 Off | On |
| N/A 29C P0 42W / 300W | 39MiB / 81920MiB | N/A Default |
| | | Enabled |
+-------------------------------+----------------------+----------------------+
| 1 NVIDIA A100 80G... On | 00000000:B1:00.0 Off | On |
| N/A 38C P0 45W / 300W | 39MiB / 81920MiB | N/A Default |
| | | Enabled |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| MIG devices: |
+------------------+----------------------+-----------+-----------------------+
| GPU GI CI MIG | Memory-Usage | Vol| Shared |
| ID ID Dev | BAR1-Usage | SM Unc| CE ENC DEC OFA JPG|
| | | ECC| |
|==================+======================+===========+=======================|
| 0 3 0 0 | 13MiB / 19968MiB | 28 0 | 2 0 1 0 0 |
| | 0MiB / 32767MiB | | |
+------------------+----------------------+-----------+-----------------------+
| 0 5 0 1 | 13MiB / 19968MiB | 28 0 | 2 0 1 0 0 |
| | 0MiB / 32767MiB | | |
+------------------+----------------------+-----------+-----------------------+
| 0 6 0 2 | 13MiB / 19968MiB | 28 0 | 2 0 1 0 0 |
| | 0MiB / 32767MiB | | |
+------------------+----------------------+-----------+-----------------------+
| 1 3 0 0 | 13MiB / 19968MiB | 28 0 | 2 0 1 0 0 |
| | 0MiB / 32767MiB | | |
+------------------+----------------------+-----------+-----------------------+
| 1 4 0 1 | 13MiB / 19968MiB | 28 0 | 2 0 1 0 0 |
| | 0MiB / 32767MiB | | |
+------------------+----------------------+-----------+-----------------------+
| 1 5 0 2 | 13MiB / 19968MiB | 28 0 | 2 0 1 0 0 |
| | 0MiB / 32767MiB | | |
+------------------+----------------------+-----------+-----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
$
$ nvidia-smi -L
GPU 0: NVIDIA A100 80GB PCIe (UUID: GPU-20bed2f5-819b-69ad-8d42-d3e1446080c1)
MIG 2g.20gb Device 0: (UUID: MIG-90443af0-2fe6-57fb-86fe-54186d5a6581)
MIG 2g.20gb Device 1: (UUID: MIG-0639aa14-ce99-56c6-a34a-354f5eb7d167)
MIG 2g.20gb Device 2: (UUID: MIG-e246d687-cb86-58e3-b88e-8222ac40255b)
GPU 1: NVIDIA A100 80GB PCIe (UUID: GPU-b0d3e82f-1fca-b00f-d2ea-8ee3940ebe64)
MIG 2g.20gb Device 0: (UUID: MIG-278a0afa-d742-502a-926a-f362a8aaa07e)
MIG 2g.20gb Device 1: (UUID: MIG-153c7049-3521-5d47-ae04-aa7cc45d208b)
MIG 2g.20gb Device 2: (UUID: MIG-30ee243d-b9bd-563c-b61b-f2e3ff3c9a13)
$
You can walk a GPU device to get all of its MIG devices and then call GetGPUInstance and GetComputeInstance against each MIG device.
Visit each mig device: https://gitlab.com/nvidia/cloud-native/go-nvlib/-/blob/main/pkg/nvlib/device/device.go#L100
Get GI given a mig device handle: https://gitlab.com/nvidia/headers/cuda-individual/nvml_dev/-/blob/main/nvml.h#L8937
Get CI given a mig device handle: https://gitlab.com/nvidia/headers/cuda-individual/nvml_dev/-/blob/main/nvml.h#L8958
Thanks a lot @klueska! It is exactly what I needed.
Can this be closed?
Thanks for the support @klueska @XuehaiPan @elezar. Closing this since the query is resolved.
Is there a way to get the
MIG GPU instance UUID
from(GPU UUID, GI_ID)
. For example, in the below case, I am looking to getMIG-90443af0-2fe6-57fb-86fe-54186d5a6581
from(GPU-20bed2f5-819b-69ad-8d42-d3e1446080c1, 3)