NVIDIA / DCGM

NVIDIA Data Center GPU Manager (DCGM) is a project for gathering telemetry and measuring the health of NVIDIA GPUs
Apache License 2.0
373 stars 49 forks source link

[feature request] add support of config file support which will be applied to all SKUs #101

Open dmonakhov opened 1 year ago

dmonakhov commented 1 year ago

Currently config options are grouped to individual SKUs like follows:

version: xx
spec: dcgm-diag-v1
skus:
  - name: GPU-name
    id: GPU part number
    memtest:
      test_duration: 10

This is makes it hard to customize test options which is independent from GPU type for example I want to create extra long stress test config which make sense to all platforms memtest.test_rudation=3600, config should has duplicate options for each possible GPUs which is absolutely impractical, or pass this parameter via "-p" option which is not very useful it list of parameters is long enough. It will be super useful to have an ability to support config syntax which may be applied any GPUs For example: User creates portable stress profile once

$ cat  stress.yaml 
version: xx
spec: dcgm-diag-v1
skus:
  - name: *  #<=Applied to any GPUs
    id: *        #<= Applied to any GPUs
    memtest:
      is_allowed: true
      test_duration: 3600

Use portable stress profile

dcgm diag -c  stress.yaml  -r 4