NVIDIA / DCGM

NVIDIA Data Center GPU Manager (DCGM) is a project for gathering telemetry and measuring the health of NVIDIA GPUs
Apache License 2.0
355 stars 49 forks source link

diag --configfile option is silently ignored if --parameters options is present #148

Open dmonakhov opened 5 months ago

dmonakhov commented 5 months ago

I use aws-platform.yaml config file for passing platform characteristics which are never changes:

version: AWS-0.1
spec: dcgm-diag-v1
skus:
  - name: NVIDIA H100 80GB HBM3 p5.48xlarge
    id: 2330
    pcie:
      is_allowed: true
      h2d_d2h_single_pinned:
        min_pci_generation: 5.0
        min_pci_width: 16.0
        min_bandwidth: 14.0
        max_latency: 5
      h2d_d2h_single_unpinned:
        min_pci_generation: 5.0
        min_pci_width: 16.0
        min_bandwidth: 14.0
      gpu_nvlinks_expected_up: 18
      nvswitch_nvlinks_expected_up: 6

But also want to customize other parameters like test_duration and use --parameters option for this:

dcgmi diag --verbose --json --configfile diag-aws.yaml --run long --parameters memtest.test_duration=120

But it is appeared that --configfile options will be silently ignored if --parameters option is present. And nvvs will called in configless mode:

 /usr/share/nvidia-validation-suite/nvvs -j -z --specifiedtest long --parameters memtest.test_duration=120 --configless -v --indexes 0,1,2,3,4,5,6,7 

Which is very cont intuitive and makes it hard to quick parameters prototyping, because either configfile, or parameters should be used. And passing all system parameters with --parameters seems not very practical.