NVIDIA / DCGM

NVIDIA Data Center GPU Manager (DCGM) is a project for gathering telemetry and measuring the health of NVIDIA GPUs
Apache License 2.0
387 stars 50 forks source link

dcgm dagnostic command exits with status 226 #158

Open rajeshvenkata opened 6 months ago

rajeshvenkata commented 6 months ago

We are running below dcgm dagnostic command in ec2 instance through a docker container. Command runs for some time (~30 mins) and exits with status code 226. No other details on the errors.

command: dcgmi "diag", "--run", "4", "-p", "memtest.test0=true\;memtest.test1=true\;memtest.test2=true\;memtest.test3=true\;memtest." + "test4=true\;memtest.test5=true\;memtest.test6=true\;memtest.test7=true\;memtest.test8=true\;memtest." + "test9=true\;memtest.test10=true", "--json"

dcgm version: 3.3.5

Please let us know if there are any pointers on how to debug the issue. Thanks!

nikkon-dev commented 6 months ago

The 226 error means DCGM_ST_NVVS_ERROR (-30) - dcgmi returns negative error codes that are represented as unsigned ints. This error means there is some unexpected error in nvvs (backend for diagnostics). Please look for both nv-hostengine.log and nvvs.log in the /var/log/nvidia