NVIDIA / DCGM

NVIDIA Data Center GPU Manager (DCGM) is a project for gathering telemetry and measuring the health of NVIDIA GPUs
Apache License 2.0
373 stars 49 forks source link

PROF_PCIE_[T|R]X_BYTES is N/A in A100 mig #88

Closed luckqk closed 1 year ago

luckqk commented 1 year ago

Is PROFPCIE[T|R]X_BYTES supported in A100 mig mode? I ran this cmd dcgmi dmon -e 1009,1010 -g 16 and give me a return below image

I can fetch PROF_GR_ENGINE_ACTIVE,PROF_SM_ACTIVE or other.
How to debug or can you give me some advice?

Condition: GPU: A100 dcgm-exporter version: 3.1.8-3.1.5 nv-hostengine: 3.1.8

nikkon-dev commented 1 year ago

@luckqk,

The PCI_TX/RX metric does not have any physical meaning for MIG. Thus we return N/A for MIG instances/compute instances. Only the whole GPU will get the values.

luckqk commented 1 year ago

@luckqk,

The PCI_TX/RX metric does not have any physical meaning for MIG. Thus we return N/A for MIG instances/compute instances. Only the whole GPU will get the values.

Thanks for your reply.