NVIDIA / DCGM

NVIDIA Data Center GPU Manager (DCGM) is a project for gathering telemetry and measuring the health of NVIDIA GPUs
Apache License 2.0
373 stars 49 forks source link

DCGM Diag Command on Mig Instance #94

Closed yasirjamal87 closed 1 year ago

yasirjamal87 commented 1 year ago

Does nvidia support running Diag utility on a mig?

sudo dcgmi diag -g 0 -r 3
GPU 0's MIG configuration is incompatible with the diagnostic because it prevents access to the entire GPU.
GPU 0's MIG configuration is incompatible with the diagnostic because it prevents access to the entire GPU.
yasir@517m214:~$ sudo nvidia-smi -L
GPU 0: NVIDIA A100 80GB PCIe (UUID: GPU-a22537c7-6eaa-7bdb-0ba4-45e67e84e963)
  MIG 3g.40gb     Device  0: (UUID: MIG-1503b9a7-3336-58e3-b589-bb25ef1e39f1)
  MIG 3g.40gb     Device  1: (UUID: MIG-9dcf6483-16e6-5a0f-8db3-8a8d4969cd59)
GPU 1: NVIDIA A100 80GB PCIe (UUID: GPU-91d61502-6349-8428-a597-fd33747ccbee)
  MIG 3g.40gb     Device  0: (UUID: MIG-1854de69-a51b-5fbd-8654-82ab2c663383)
  MIG 3g.40gb     Device  1: (UUID: MIG-cb9fd9f0-618e-5961-be9f-6d6fae2cddbd)
yasir@517m214:~$ export CUDA_VISIBLE_DEVICES="MIG-1503b9a7-3336-58e3-b589-bb25ef1e39f1"
yasir@517m214:~$ sudo dcgmi diag -g 1 -r 3
Error: Unable to complete diagnostic for group 1. Return: (-35) The specified group is empty, and this operation is incompatible with an empty group.
yasir@517m214:~$ sudo dcgmi diag -g 0 -r 3
GPU 0's MIG configuration is incompatible with the diagnostic because it prevents access to the entire GPU.
GPU 0's MIG configuration is incompatible with the diagnostic because it prevents access to the entire GPU.
yasir@517m214:~$
nikkon-dev commented 1 year ago

Hi @yasirjamal87,

Unfortunately, DCGM does not provide diagnostics support for MIG-enabled GPUs. This is because many GPU characteristics (power consumption, temperature, etc.) checked by dcgmi diag do not apply to MIG instances.

Best regards, Nik

yasirjamal87 commented 1 year ago

Thanks.