NVIDIA / go-nvml

Go Bindings for the NVIDIA Management Library (NVML)
Apache License 2.0
292 stars 62 forks source link

Question regarding MIG UUID mapping #57

Closed starry91 closed 1 year ago

starry91 commented 1 year ago

Is there a way to get the MIG GPU instance UUID from (GPU UUID, GI_ID). For example, in the below case, I am looking to get MIG-90443af0-2fe6-57fb-86fe-54186d5a6581 from (GPU-20bed2f5-819b-69ad-8d42-d3e1446080c1, 3)

$ nvidia-smi
Thu Feb  2 00:16:11 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 520.61.05    Driver Version: 520.61.05    CUDA Version: 11.8     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A100 80G...  On   | 00000000:31:00.0 Off |                   On |
| N/A   30C    P0    42W / 300W |     26MiB / 81920MiB |     N/A      Default |
|                               |                      |              Enabled |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA A100 80G...  On   | 00000000:B1:00.0 Off |                   On |
| N/A   38C    P0    45W / 300W |     26MiB / 81920MiB |     N/A      Default |
|                               |                      |              Enabled |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| MIG devices:                                                                |
+------------------+----------------------+-----------+-----------------------+
| GPU  GI  CI  MIG |         Memory-Usage |        Vol|         Shared        |
|      ID  ID  Dev |           BAR1-Usage | SM     Unc| CE  ENC  DEC  OFA  JPG|
|                  |                      |        ECC|                       |
|==================+======================+===========+=======================|
|  0    3   0   0  |     13MiB / 19968MiB | 28      0 |  2   0    1    0    0 |
|                  |      0MiB / 32767MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
|  0    5   0   1  |     13MiB / 19968MiB | 28      0 |  2   0    1    0    0 |
|                  |      0MiB / 32767MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
|  1    3   0   0  |     13MiB / 19968MiB | 28      0 |  2   0    1    0    0 |
|                  |      0MiB / 32767MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
|  1    5   0   1  |     13MiB / 19968MiB | 28      0 |  2   0    1    0    0 |
|                  |      0MiB / 32767MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
$ nvidia-smi -L
GPU 0: NVIDIA A100 80GB PCIe (UUID: GPU-20bed2f5-819b-69ad-8d42-d3e1446080c1)
  MIG 2g.20gb     Device  0: (UUID: MIG-90443af0-2fe6-57fb-86fe-54186d5a6581)
  MIG 2g.20gb     Device  1: (UUID: MIG-0639aa14-ce99-56c6-a34a-354f5eb7d167)
GPU 1: NVIDIA A100 80GB PCIe (UUID: GPU-b0d3e82f-1fca-b00f-d2ea-8ee3940ebe64)
  MIG 2g.20gb     Device  0: (UUID: MIG-278a0afa-d742-502a-926a-f362a8aaa07e)
  MIG 2g.20gb     Device  1: (UUID: MIG-30ee243d-b9bd-563c-b61b-f2e3ff3c9a13)
$
XuehaiPan commented 1 year ago

You can get the MIG device by MIG-<MIG UUID> or MIG-GPU-<GPU UUID>/<GI ID>/<CI ID>. For example:

MIG-90443af0-2fe6-57fb-86fe-54186d5a6581
MIG-GPU-20bed2f5-819b-69ad-8d42-d3e1446080c1/3/0

are referring to the same MIG device 0:0.

Ref: https://github.com/XuehaiPan/nvitop/blob/v1.0.0/nvitop/api/device.py#L220-L240

# https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#env-vars
# https://docs.nvidia.com/datacenter/tesla/mig-user-guide/index.html#cuda-visible-devices
# GPU UUID        : `GPU-<GPU-UUID>`
# MIG UUID        : `MIG-GPU-<GPU-UUID>/<GPU instance ID>/<compute instance ID>`
# MIG UUID (R470+): `MIG-<MIG-UUID>`
UUID_PATTERN = re.compile(
    r"""^  # full match
    (?:(?P<MigMode>MIG)-)?                                 # prefix for MIG UUID
    (?:(?P<GpuUuid>GPU)-)?                                 # prefix for GPU UUID
    (?(MigMode)|(?(GpuUuid)|GPU-))                         # always have a prefix
    (?P<UUID>[0-9a-f]{8}(?:-[0-9a-f]{4}){3}-[0-9a-f]{12})  # UUID for the GPU/MIG device in lower case
    # Suffix for MIG device while using GPU UUID with GPU instance (GI) ID and compute instance (CI) ID
    (?(MigMode)                                            # match only when the MIG prefix matches
        (?(GpuUuid)                                        # match only when provide with GPU UUID
            /(?P<GpuInstanceId>\d+)                        # GI ID of the MIG device
            /(?P<ComputeInstanceId>\d+)                    # CI ID of the MIG device
        |)
    |)
    $""",  # full match
    flags=re.VERBOSE,
)
starry91 commented 1 year ago

Thanks @XuehaiPan, but my question was more around if there is way to get the value MIG UUID from (GPU UUID, GI ID), i.e, 90443af0-2fe6-57fb-86fe-54186d5a6581 from (GPU-20bed2f5-819b-69ad-8d42-d3e1446080c1, 3)? (I am using R470+) FWIW, there is a GetMigDeviceHandleByIndex API, but that takes in the GI INDEX and not the GI ID

XuehaiPan commented 1 year ago

but my question was more around if there is way to get the value MIG UUID from (GPU UUID, GI ID), i.e, 90443af0-2fe6-57fb-86fe-54186d5a6581 from (GPU-20bed2f5-819b-69ad-8d42-d3e1446080c1, 3)?

You cannot identify a MIG device without CI ID because a GPU instance can have multiple compute instances, each is a different MIG device.

You can use:

MigDevice, _ := DeviceGetHandleByUUID("MIG-GPU-20bed2f5-819b-69ad-8d42-d3e1446080c1/3/0")
MigUuid, _ := MigDevice.GetUUID()  // -> MIG-90443af0-2fe6-57fb-86fe-54186d5a6581
$ nvidia-smi
Thu Feb  2 15:30:15 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.78.01    Driver Version: 525.78.01    CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A100-PCI...  On   | 00000000:1D:00.0 Off |                   On |
| N/A   30C    P0    31W / 250W |     45MiB / 40960MiB |     N/A      Default |
|                               |                      |              Enabled |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| MIG devices:                                                                |
+------------------+----------------------+-----------+-----------------------+
| GPU  GI  CI  MIG |         Memory-Usage |        Vol|         Shared        |
|      ID  ID  Dev |           BAR1-Usage | SM     Unc| CE  ENC  DEC  OFA  JPG|
|                  |                      |        ECC|                       |
|==================+======================+===========+=======================|
|  0    2   0   0  |     19MiB / 19968MiB | 42      0 |  3   0    2    0    0 |
|                  |      0MiB / 32767MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
|  0    3   0   1  |     13MiB /  9856MiB | 28      0 |  2   0    1    0    0 |
|                  |      0MiB / 16383MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
|  0    9   0   2  |      6MiB /  4864MiB | 14      0 |  1   0    0    0    0 |
|                  |      0MiB /  8191MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
|  0   10   0   3  |      6MiB /  4864MiB | 14      0 |  1   0    0    0    0 |
|                  |      0MiB /  8191MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

$ nvidia-smi -L
GPU 0: NVIDIA A100-PCIE-40GB (UUID: GPU-c96ffde7-75ee-2f7e-255b-d34a594c752b)
  MIG 3g.20gb     Device  0: (UUID: MIG-418b3ff6-54a5-5bf6-915f-f5e43a266bc0)
  MIG 2g.10gb     Device  1: (UUID: MIG-53bd3992-6571-5d69-b5b3-0d9561284cc0)
  MIG 1g.5gb      Device  2: (UUID: MIG-0557e0ba-657b-5162-84eb-7c309eee8899)
  MIG 1g.5gb      Device  3: (UUID: MIG-4bb5a7a2-316f-5941-bc2b-9bea41260fb5)
In [1]: from nvitop import Device

In [2]: mig = Device("MIG-GPU-c96ffde7-75ee-2f7e-255b-d34a594c752b/2/0")

In [3]: mig
Out[3]: MigDevice(index=(0, 0), name="NVIDIA A100-PCIE-40GB MIG 3g.20gb", total_memory=19968MiB)

In [4]: mig.uuid()
Out[4]: 'MIG-418b3ff6-54a5-5bf6-915f-f5e43a266bc0'
klueska commented 1 year ago

While it should theoretically be possible to get a MIG device handle from the 3-tuple of (GPU device handle, GI, CI), there is unfortunately no direct NVML API for that. I've had an internal RFE open for this API for a while now, but no movement yet.

starry91 commented 1 year ago

Thanks @XuehaiPan! That should help.

@klueska @XuehaiPan Is there a way to programmatically get all the (GI,CI) for GPU device by just knowing its the GPU index? I tried looking for a way to iterate over the GPUs and then get the GPU instances for it, but I could not find any relevant API.

klueska commented 1 year ago

Here is an example of walking all of the GIs currently created on a device: https://github.com/NVIDIA/mig-parted/blob/main/internal/nvlib/mig/mig.go#L67

And likewise all of the CIs currently created in a GI: https://github.com/NVIDIA/mig-parted/blob/main/internal/nvlib/mig/mig.go#L95

And here is an example of their usage together: https://github.com/NVIDIA/mig-parted/blob/main/pkg/mig/config/config.go#L80

starry91 commented 1 year ago

Here is an example of walking all of the GIs currently created on a device: https://github.com/NVIDIA/mig-parted/blob/main/internal/nvlib/mig/mig.go#L67

@klueska Does GetGpuInstances require some elevated permissions? I seem to get ERROR_NO_PERMISSION when trying it out. Is that expected? Following is the stdout and code I was running.

Output:

INFO[0000] found 2 devices on host
ERRO[0000] error getting GPU instances for profile '0': 4
INFO[0000] MIG UUID -
INFO[0000] MIG UUID -
INFO[0000] MIG UUID -
INFO[0000] MIG UUID -
INFO[0000] MIG UUID -
INFO[0000] MIG UUID -
INFO[0000] MIG UUID -
ERRO[0000] error getting GPU instances for profile '1': 4
INFO[0000] MIG UUID -
INFO[0000] MIG UUID -
INFO[0000] MIG UUID -
ERRO[0000] error getting GPU instances for profile '2': 4
INFO[0000] MIG UUID -
INFO[0000] MIG UUID -
ERRO[0000] error getting GPU instances for profile '3': 4
INFO[0000] MIG UUID -
ERRO[0000] error getting GPU instances for profile '4': 4
INFO[0000] MIG UUID -
ERRO[0000] error getting GPU instances for profile '7': 4
INFO[0000] MIG UUID -

Code:

deviceCount, err := dcgm.GetAllDeviceCount()
if err != nil {
    logrus.Fatal(err)
}
logrus.Infof("found %v devices on host", deviceCount)
for gpuIndex := 0; gpuIndex < int(deviceCount); gpuIndex++ {
    device, err := nvml.DeviceGetHandleByIndex(gpuIndex)
    if err != nvml.SUCCESS {
        logrus.Fatal(err)
    }

    gpuUuid, err := device.GetUUID()
    if err != nvml.SUCCESS {
        logrus.Fatal(err)
    }
    for i := 0; i < nvml.GPU_INSTANCE_PROFILE_COUNT; i++ {
        giProfileInfo, ret := device.GetGpuInstanceProfileInfo(i)
        if ret == nvml.ERROR_NOT_SUPPORTED {
            continue
        }
        if ret == nvml.ERROR_INVALID_ARGUMENT {
            continue
        }
        if ret != nvml.SUCCESS {
            logrus.Errorf("error getting GPU instance profile info for '%v': %v", i, ret)
        }

        gis, ret := device.GetGpuInstances(&giProfileInfo)
        if ret != nvml.SUCCESS {
            logrus.Errorf("error getting GPU instances for profile '%v': %v", i, ret)
        }

        for _, gi := range gis {
            info, _ := gi.GetInfo()
            uuid, _ := info.Device.GetUUID()
            logrus.Infof("MIG UUID - %s", uuid)
        }
    }
}
elezar commented 1 year ago

@starry91 are you running the example in a container or with non-standard permissions. I believe a user needs access to the MIG monitor capability (/dev/nvidia-caps/nvidia-cap2) to list the MIG devices available. (@klueska please correct me if I'm wrong).

klueska commented 1 year ago

Yes, it needs to be privileged: https://gitlab.com/nvidia/headers/cuda-individual/nvml_dev/-/blob/main/nvml.h#L8587

starry91 commented 1 year ago

@klueska @elezar I am running it outside containers without any elevated permissions. Is there a way to get the information for a normal user(without elevated permissions)? I seem to get similar info from nvidia-smi so guessing there must be a way.

$ nvidia-smi
Mon Feb  6 08:46:30 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 520.61.05    Driver Version: 520.61.05    CUDA Version: 11.8     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A100 80G...  On   | 00000000:31:00.0 Off |                   On |
| N/A   29C    P0    42W / 300W |     39MiB / 81920MiB |     N/A      Default |
|                               |                      |              Enabled |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA A100 80G...  On   | 00000000:B1:00.0 Off |                   On |
| N/A   38C    P0    45W / 300W |     39MiB / 81920MiB |     N/A      Default |
|                               |                      |              Enabled |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| MIG devices:                                                                |
+------------------+----------------------+-----------+-----------------------+
| GPU  GI  CI  MIG |         Memory-Usage |        Vol|         Shared        |
|      ID  ID  Dev |           BAR1-Usage | SM     Unc| CE  ENC  DEC  OFA  JPG|
|                  |                      |        ECC|                       |
|==================+======================+===========+=======================|
|  0    3   0   0  |     13MiB / 19968MiB | 28      0 |  2   0    1    0    0 |
|                  |      0MiB / 32767MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
|  0    5   0   1  |     13MiB / 19968MiB | 28      0 |  2   0    1    0    0 |
|                  |      0MiB / 32767MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
|  0    6   0   2  |     13MiB / 19968MiB | 28      0 |  2   0    1    0    0 |
|                  |      0MiB / 32767MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
|  1    3   0   0  |     13MiB / 19968MiB | 28      0 |  2   0    1    0    0 |
|                  |      0MiB / 32767MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
|  1    4   0   1  |     13MiB / 19968MiB | 28      0 |  2   0    1    0    0 |
|                  |      0MiB / 32767MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
|  1    5   0   2  |     13MiB / 19968MiB | 28      0 |  2   0    1    0    0 |
|                  |      0MiB / 32767MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
$
$ nvidia-smi -L
GPU 0: NVIDIA A100 80GB PCIe (UUID: GPU-20bed2f5-819b-69ad-8d42-d3e1446080c1)
  MIG 2g.20gb     Device  0: (UUID: MIG-90443af0-2fe6-57fb-86fe-54186d5a6581)
  MIG 2g.20gb     Device  1: (UUID: MIG-0639aa14-ce99-56c6-a34a-354f5eb7d167)
  MIG 2g.20gb     Device  2: (UUID: MIG-e246d687-cb86-58e3-b88e-8222ac40255b)
GPU 1: NVIDIA A100 80GB PCIe (UUID: GPU-b0d3e82f-1fca-b00f-d2ea-8ee3940ebe64)
  MIG 2g.20gb     Device  0: (UUID: MIG-278a0afa-d742-502a-926a-f362a8aaa07e)
  MIG 2g.20gb     Device  1: (UUID: MIG-153c7049-3521-5d47-ae04-aa7cc45d208b)
  MIG 2g.20gb     Device  2: (UUID: MIG-30ee243d-b9bd-563c-b61b-f2e3ff3c9a13)
$
klueska commented 1 year ago

You can walk a GPU device to get all of its MIG devices and then call GetGPUInstance and GetComputeInstance against each MIG device.

Visit each mig device: https://gitlab.com/nvidia/cloud-native/go-nvlib/-/blob/main/pkg/nvlib/device/device.go#L100

Get GI given a mig device handle: https://gitlab.com/nvidia/headers/cuda-individual/nvml_dev/-/blob/main/nvml.h#L8937

Get CI given a mig device handle: https://gitlab.com/nvidia/headers/cuda-individual/nvml_dev/-/blob/main/nvml.h#L8958

starry91 commented 1 year ago

Thanks a lot @klueska! It is exactly what I needed.

klueska commented 1 year ago

Can this be closed?

starry91 commented 1 year ago

Thanks for the support @klueska @XuehaiPan @elezar. Closing this since the query is resolved.