TACC / tacc_stats

TACC Stats is an automated resource-usage monitoring and analysis package.
GNU Lesser General Public License v2.1
41 stars 15 forks source link

Integrate NVIDIA libraries for accellerators #22

Open stephenlienharrell opened 1 year ago

stephenlienharrell commented 1 year ago

https://docs.nvidia.com/datacenter/dcgm/latest/user-guide/feature-overview.html

^^^ library

stephenlienharrell commented 1 year ago

Branch for this issue: https://github.com/TACC/tacc_stats/tree/dcgm_support

stephenlienharrell commented 1 year ago

Using this document to see what metrics Cazes wants: https://github.com/NVIDIA/dcgm-exporter/blob/main/etc/dcp-metrics-included.csv

(From Cazes:) From the PCI section, I’d like to keep track of bytes moved over the PCI bus.  We probably need to talk to NVidia on on this one because I don’t know how retries factors in.  If it doesn’t, then just keep track of bytes transmitted/recevied.  It’s allso not clear which direction is transmit/receive.

PCIE

DCGM_FI_DEV_PCIE_TX_THROUGHPUT counter Total number of bytes transmitted through PCIe TX (in KB) via NVML.

DCGM_FI_DEV_PCIE_RX_THROUGHPUT counter Total number of bytes received through PCIe RX (in KB) via NVML.

DCGM_FI_DEV_PCIE_REPLAY_COUNTER counter Total number of PCIe retries.

In a similar vein, I’d like to see what bandwidth we’re getting across the PCIe bus:

DCGM_FI_PROF_PCIE_TX_BYTES gauge The rate of data transmitted over the PCIe bus - including both protocol headers and data payloads - in bytes per second.

DCGM_FI_PROF_PCIE_RX_BYTES gauge The rate of data received over the PCIe bus - including both protocol headers and data payloads - in bytes per second.

And finally, are the tensor cores being used:

DCGM_FI_PROF_PIPE_TENSOR_ACTIVE gauge Ratio of cycles the tensor (HMMA) pipe is active (in %).

I don’t see a metric to measure memory bandwidth from the HBM.

These values should tell us how well the GPU is being used and whether or not the tensor cores are being used.  I don’t expect to see them used unless it’s a pytorch or tf job.

We should also be able to tell if the GPU is spending more time moving data than calculating.