NVIDIA / dcgm-exporter

NVIDIA GPU metrics exporter for Prometheus leveraging DCGM
Apache License 2.0
923 stars 159 forks source link

sum of DCGM_FI_DEV_FB_USED and DCGM_FI_DEV_FB_FREE is not const #271

Open ccding opened 8 months ago

ccding commented 8 months ago

our usecase needs to show the gpu memory usage over total memory so we used the sum of the above two metrics as the GPU total memory, but it seems the sum is not const

here is the output

image

glowkey commented 8 months ago

Depending on your driver version you may also need to DCGM_FI_DEV_FB_RESERVED in your equation.

ccding commented 8 months ago

Thanks for the response. My driver version is 535.129.03 and I don't see DCGM_FI_DEV_FB_RESERVED in my prometheus

The output of nvidia-smi has the accurate and constant total GPU memory

These are the only available metrics

image

ccding commented 8 months ago

@nvvfedorov is this fixed?

zclyne commented 1 month ago

I have exactly the same use case and ran into the same issue. Why is there no metric showing the total number of GPU memory (like 32GB for V100, 80GB for H100, etc)?

nvvfedorov commented 1 month ago

I reopened the issue as active and interesting for the community.

danilkaz commented 2 weeks ago

I have the same problem

I need to count some statistics about using gpu, but I get wrong results