Open stephenlienharrell opened 1 year ago
Branch for this issue: https://github.com/TACC/tacc_stats/tree/dcgm_support
Using this document to see what metrics Cazes wants: https://github.com/NVIDIA/dcgm-exporter/blob/main/etc/dcp-metrics-included.csv
(From Cazes:) From the PCI section, I’d like to keep track of bytes moved over the PCI bus. We probably need to talk to NVidia on on this one because I don’t know how retries factors in. If it doesn’t, then just keep track of bytes transmitted/recevied. It’s allso not clear which direction is transmit/receive.
In a similar vein, I’d like to see what bandwidth we’re getting across the PCIe bus:
And finally, are the tensor cores being used:
I don’t see a metric to measure memory bandwidth from the HBM.
These values should tell us how well the GPU is being used and whether or not the tensor cores are being used. I don’t expect to see them used unless it’s a pytorch or tf job.
We should also be able to tell if the GPU is spending more time moving data than calculating.
https://docs.nvidia.com/datacenter/dcgm/latest/user-guide/feature-overview.html
^^^ library