NVIDIA / DCGM

NVIDIA Data Center GPU Manager (DCGM) is a project for gathering telemetry and measuring the health of NVIDIA GPUs
Apache License 2.0
373 stars 49 forks source link

Bundled CUDA libraries #89

Open zzzoom opened 1 year ago

zzzoom commented 1 year ago

Installing dcgm on a stateless node is untenable at the moment, because the dcgm package is a 1.5GB behemoth of which ~1GB are different versions of CUBLAS and ~240MB are different versions of CURAND. Appropiate versions of both libraries are usually present somewhere else in the system as they are pretty much essential for CUDA applications.

Please consider packaging those libraries separately and letting us point DCGM to our own location for those libraries.