NVIDIA / DCGM

NVIDIA Data Center GPU Manager (DCGM) is a project for gathering telemetry and measuring the health of NVIDIA GPUs
Apache License 2.0
355 stars 49 forks source link

Memory usage by dcgm during runtime diagnostics #163

Open BetaZYN opened 2 months ago

BetaZYN commented 2 months ago

When running extended-level diagnostics on 8 cards simultaneously, 8 H20s may occupy approximately 8GB of memory at most, while 8 H800s may occupy up to 16GB of memory. What causes such a significant difference in memory usage?

nikkon-dev commented 2 months ago

@BetaZYN,

The amount of memory allocated depends on the available VRAM on GPUs. Several memory tests require allocating large buffers in memory.

BetaZYN commented 2 months ago

@nikkon-dev However, the H20 has 96GB of VRAM while the H800 has 80GB. Interestingly, during diagnostic runs, the DCGM memory consumption on the H800 is twice that of the H20.