NVIDIA / DCGM

NVIDIA Data Center GPU Manager (DCGM) is a project for gathering telemetry and measuring the health of NVIDIA GPUs
Apache License 2.0
373 stars 49 forks source link

'dcgmi discovery -c' not returning MIG instances for H100 #91

Closed Sipondo closed 1 year ago

Sipondo commented 1 year ago

Hey!

We have had a DGX machine with 4xA100 cards on which MIG mode and DCGMI work great. We've expanded our hardware with a machine with an H100 now, but sadly dcgmi discovery -c does not return an instance hierarchy on the new machine. We've removed and installed everything related to Nvidia multiple times in order to troubleshoot the issue, to no avail. The logs below show the output when there is one GPU instance and compute instance present.

Software: dcgmi 3.1.8 nvidia-smi 535.54.03 CUDA 12.2 Ubuntu 22.04.2 LTS

nvidia-smi:

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.54.03              Driver Version: 535.54.03    CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA H100 PCIe               On  | 00000000:17:00.0 Off |                   On |
| N/A   29C    P0              56W / 350W |     12MiB / 81559MiB |     N/A      Default |
|                                         |                      |              Enabled |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| MIG devices:                                                                          |
+------------------+--------------------------------+-----------+-----------------------+
| GPU  GI  CI  MIG |                   Memory-Usage |        Vol|      Shared           |
|      ID  ID  Dev |                     BAR1-Usage | SM     Unc| CE ENC DEC OFA JPG    |
|                  |                                |        ECC|                       |
|==================+================================+===========+=======================|
|  0    9   0   0  |              12MiB /  9984MiB  | 14      0 |  1   0    1    0    1 |
|                  |               0MiB / 16383MiB  |           |                       |
+------------------+--------------------------------+-----------+-----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

dcgmi discovery -c:

+-------------------+--------------------------------------------------------------------+
| Instance Hierarchy                                                                     |
+===================+====================================================================+
+-------------------+--------------------------------------------------------------------+
nikkon-dev commented 1 year ago

@Sipondo,

Could you please collect and share the nv-hostengine debug logs? sudo nv-hostengine -f host.debug.log --log-level debug

WBR, Nik

Sipondo commented 1 year ago

@nikkon-dev restarting the hostengine fixed my issue! Thank you for your support!