NVIDIA / DCGM

NVIDIA Data Center GPU Manager (DCGM) is a project for gathering telemetry and measuring the health of NVIDIA GPUs
Apache License 2.0
373 stars 49 forks source link

Support for reporting FP8 and Transformer Engine usage on H100 GPU's #86

Open hassanbabaie opened 1 year ago

hassanbabaie commented 1 year ago

I'm wondering what the plan is on being able to breakout and report on FP8 and Transformer Engine usage on H100's via DCGM (and so we then get it via DCGM Exporter)

DCGM supports FP64,FP32,FP16 but it seems like we're missing an update to be able to break out/detect usage some of the new features

I doubled checked here and don't see obvious one that I would look at to detect this type of usage?

dcgmlib/dcgm_fields.h

Any thoughts on this would be appreciated

Thanks

nikkon-dev commented 1 year ago

Hello @hassanbabaie,

Unfortunately, it is currently not possible to break down pipelines in order to isolate FP8 utilization.

rnertney commented 11 months ago

A good recommendation is to review these

The IMMA is int8/fp8 tensor instructions. HMMA is FP16/32 tensor.

This would give you some sort of correlation as to usage of the tensors; I recommend doing some heuristics to see if they correlate as one might expect.

hassanbabaie commented 10 months ago

Hi @rnertney just a quick heads up, I'm not sure if we're seeing this. We had an FP8 run and did not see trigger the IMMA metric

image

hassanbabaie commented 8 months ago

@rnertney any luck on the above ^^ thanks again