NVIDIA / DCGM

NVIDIA Data Center GPU Manager (DCGM) is a project for gathering telemetry and measuring the health of NVIDIA GPUs
Apache License 2.0
404 stars 52 forks source link

like dcgmi dmon -e 1002,1004,.... ,I want to include all field ids but, Got error while creating a Field Group: Bad parameter passed to function #44

Closed hyoonseo159357 closed 2 years ago

hyoonseo159357 commented 2 years ago

I want to include all field ids like dcgmi dmon -e 1002,1004..............

However, if I enter more than a certain number, the following error occurs:

Got error while creating a Field Group: Bad parameter passed to function

Is there a way to print all fields through dcgmi dmon ? No matter how much I search, I can't find it.

nikkon-dev commented 2 years ago

@hyoonseo159357,

As of today there are several limitations related to the number of fields you can collect simultaneously via a single connection (single dcgmi command):

  1. There can be only 64 field groups with 128 fields in each group. A signle dcgmi command creates a single fields group. That means you will not be able to collect more than 128 fields due to this limitation alone.
  2. The communication protocol limit a single message size and all watched fields come in a signle message. That usually does not lead to the error you observe and instead you do not get all fields for all GPUs you wanted.
  3. Your GPU may not support some fields you request, and the logic is that either all requested fields are available or the whole batch fails.

You could try to use multiple dcgmi commands with different set of fields, or use DCGM API directly to overcome the limitations.

WBR, Nik