Open rajeshvenkata opened 6 months ago
The 226 error means DCGM_ST_NVVS_ERROR (-30) - dcgmi returns negative error codes that are represented as unsigned ints. This error means there is some unexpected error in nvvs (backend for diagnostics). Please look for both nv-hostengine.log and nvvs.log in the /var/log/nvidia
We are running below dcgm dagnostic command in ec2 instance through a docker container. Command runs for some time (~30 mins) and exits with status code 226. No other details on the errors.
command: dcgmi "diag", "--run", "4", "-p", "memtest.test0=true\;memtest.test1=true\;memtest.test2=true\;memtest.test3=true\;memtest." + "test4=true\;memtest.test5=true\;memtest.test6=true\;memtest.test7=true\;memtest.test8=true\;memtest." + "test9=true\;memtest.test10=true", "--json"
dcgm version: 3.3.5
Please let us know if there are any pointers on how to debug the issue. Thanks!