NVIDIA / DCGM

NVIDIA Data Center GPU Manager (DCGM) is a project for gathering telemetry and measuring the health of NVIDIA GPUs
Apache License 2.0
355 stars 49 forks source link

Hello, why /var/log/nv-hostengine.log file had many ERROR [5231:5273] [[NvSwitch]] ReadNvSwitchStatusAllSwitches() #174

Open 13416157913 opened 1 month ago

13416157913 commented 1 month ago

Hello everyone, why my /var/log/nv-hostengine.log file had many ERROR [5231:5273] [[NvSwitch]] ReadNvSwitchStatusAllSwitches() . like this : 2024-05-24 10:11:29.243 ERROR [5231:5273] [[NvSwitch]] ReadNvSwitchStatusAllSwitches() returned Object is in an undefined state [/workspaces/dcgm-rel_dcgm_2_2-postmerge/modules/nvswitch/DcgmModuleNvSwitch.cpp:388] [DcgmNs::DcgmModuleNvSwitch::RunOnce] 2024-05-24 10:11:59.243 ERROR [5231:5273] [[NvSwitch]] ReadNvSwitchStatusAllSwitches() returned Object is in an undefined state [/workspaces/dcgm-rel_dcgm_2_2-postmerge/modules/nvswitch/DcgmModuleNvSwitch.cpp:388] [DcgmNs::DcgmModuleNvSwitch::RunOnce] 2024-05-24 10:12:29.243 ERROR [5231:5273] [[NvSwitch]] ReadNvSwitchStatusAllSwitches() returned Object is in an undefined state [/workspaces/dcgm-rel_dcgm_2_2-postmerge/modules/nvswitch/DcgmModuleNvSwitch.cpp:388] [DcgmNs::DcgmModuleNvSwitch::RunOnce] 2024-05-24 10:12:59.243 ERROR [5231:5273] [[NvSwitch]] ReadNvSwitchStatusAllSwitches() returned Object is in an undefined state [/workspaces/dcgm-rel_dcgm_2_2-postmerge/modules/nvswitch/DcgmModuleNvSwitch.cpp:388] [DcgmNs::DcgmModuleNvSwitch::RunOnce] 2024-05-24 10:13:29.243 ERROR [5231:5273] [[NvSwitch]] ReadNvSwitchStatusAllSwitches() returned Object is in an undefined state [/workspaces/dcgm-rel_dcgm_2_2-postmerge/modules/nvswitch/DcgmModuleNvSwitch.cpp:388] [DcgmNs::DcgmModuleNvSwitch::RunOnce] 2024-05-24 10:13:59.244 ERROR [5231:5273] [[NvSwitch]] ReadNvSwitchStatusAllSwitches() returned Object is in an undefined state [/workspaces/dcgm-rel_dcgm_2_2-postmerge/modules/nvswitch/DcgmModuleNvSwitch.cpp:388] [DcgmNs::DcgmModuleNvSwitch::RunOnce] 2024-05-24 10:14:29.244 ERROR [5231:5273] [[NvSwitch]] ReadNvSwitchStatusAllSwitches() returned Object is in an undefined state [/workspaces/dcgm-rel_dcgm_2_2-postmerge/modules/nvswitch/DcgmModuleNvSwitch.cpp:388] [DcgmNs::DcgmModuleNvSwitch::RunOnce] 2024-05-24 10:14:59.244 ERROR [5231:5273] [[NvSwitch]] ReadNvSwitchStatusAllSwitches() returned Object is in an undefined state [/workspaces/dcgm-rel_dcgm_2_2-postmerge/modules/nvswitch/DcgmModuleNvSwitch.cpp:388] [DcgmNs::DcgmModuleNvSwitch::RunOnce] 2024-05-24 10:15:29.244 ERROR [5231:5273] [[NvSwitch]] ReadNvSwitchStatusAllSwitches() returned Object is in an undefined state [/workspaces/dcgm-rel_dcgm_2_2-postmerge/modules/nvswitch/DcgmModuleNvSwitch.cpp:388] [DcgmNs::DcgmModuleNvSwitch::RunOnce] 2024-05-24 10:15:59.244 ERROR [5231:5273] [[NvSwitch]] ReadNvSwitchStatusAllSwitches() returned Object is in an undefined state [/workspaces/dcgm-rel_dcgm_2_2-postmerge/modules/nvswitch/DcgmModuleNvSwitch.cpp:388] [DcgmNs::DcgmModuleNvSwitch::RunOnce] 2024-05-24 10:16:29.244 ERROR [5231:5273] [[NvSwitch]] ReadNvSwitchStatusAllSwitches() returned Object is in an undefined state [/workspaces/dcgm-rel_dcgm_2_2-postmerge/modules/nvswitch/DcgmModuleNvSwitch.cpp:388] [DcgmNs::DcgmModuleNvSwitch::RunOnce] 2024-05-24 10:16:59.244 ERROR [5231:5273] [[NvSwitch]] ReadNvSwitchStatusAllSwitches() returned Object is in an undefined state [/workspaces/dcgm-rel_dcgm_2_2-postmerge/modules/nvswitch/DcgmModuleNvSwitch.cpp:388] [DcgmNs::DcgmModuleNvSwitch::RunOnce] 2024-05-24 10:17:29.244 ERROR [5231:5273] [[NvSwitch]] ReadNvSwitchStatusAllSwitches() returned Object is in an undefined state [/workspaces/dcgm-rel_dcgm_2_2-postmerge/modules/nvswitch/DcgmModuleNvSwitch.cpp:388] [DcgmNs::DcgmModuleNvSwitch::RunOnce] 2024-05-24 10:17:59.244 ERROR [5231:5273] [[NvSwitch]] ReadNvSwitchStatusAllSwitches() returned Object is in an undefined state [/workspaces/dcgm-rel_dcgm_2_2-postmerge/modules/nvswitch/DcgmModuleNvSwitch.cpp:388] [DcgmNs::DcgmModuleNvSwitch::RunOnce] 2024-05-24 10:18:29.245 ERROR [5231:5273] [[NvSwitch]] ReadNvSwitchStatusAllSwitches() returned Object is in an undefined state [/workspaces/dcgm-rel_dcgm_2_2-postmerge/modules/nvswitch/DcgmModuleNvSwitch.cpp:388] [DcgmNs::DcgmModuleNvSwitch::RunOnce] 2024-05-24 10:18:59.245 ERROR [5231:5273] [[NvSwitch]] ReadNvSwitchStatusAllSwitches() returned Object is in an undefined state [/workspaces/dcgm-rel_dcgm_2_2-postmerge/modules/nvswitch/DcgmModuleNvSwitch.cpp:388] [DcgmNs::DcgmModuleNvSwitch::RunOnce] 2024-05-24 10:19:29.245 ERROR [5231:5273] [[NvSwitch]] ReadNvSwitchStatusAllSwitches() returned Object is in an undefined state [/workspaces/dcgm-rel_dcgm_2_2-postmerge/modules/nvswitch/DcgmModuleNvSwitch.cpp:388] [DcgmNs::DcgmModuleNvSwitch::RunOnce] 2024-05-24 10:19:59.245 ERROR [5231:5273] [[NvSwitch]] ReadNvSwitchStatusAllSwitches() returned Object is in an undefined state [/workspaces/dcgm-rel_dcgm_2_2-postmerge/modules/nvswitch/DcgmModuleNvSwitch.cpp:388] [DcgmNs::DcgmModuleNvSwitch::RunOnce] 2024-05-24 10:20:29.245 ERROR [5231:5273] [[NvSwitch]] ReadNvSwitchStatusAllSwitches() returned Object is in an undefined state [/workspaces/dcgm-rel_dcgm_2_2-postmerge/modules/nvswitch/DcgmModuleNvSwitch.cpp:388] [DcgmNs::DcgmModuleNvSwitch::RunOnce] 2024-05-24 10:20:59.245 ERROR [5231:5273] [[NvSwitch]] ReadNvSwitchStatusAllSwitches() returned Object is in an undefined state [/workspaces/dcgm-rel_dcgm_2_2-postmerge/modules/nvswitch/DcgmModuleNvSwitch.cpp:388] [DcgmNs::DcgmModuleNvSwitch::RunOnce] 2024-05-24 10:21:29.245 ERROR [5231:5273] [[NvSwitch]] ReadNvSwitchStatusAllSwitches() returned Object is in an undefined state [/workspaces/dcgm-rel_dcgm_2_2-postmerge/modules/nvswitch/DcgmModuleNvSwitch.cpp:388] [DcgmNs::DcgmModuleNvSwitch::RunOnce] 2024-05-24 10:21:59.245 ERROR [5231:5273] [[NvSwitch]] ReadNvSwitchStatusAllSwitches() returned Object is in an undefined state [/workspaces/dcgm-rel_dcgm_2_2-postmerge/modules/nvswitch/DcgmModuleNvSwitch.cpp:388] [DcgmNs::DcgmModuleNvSwitch::RunOnce] 2024-05-24 10:22:29.245 ERROR [5231:5273] [[NvSwitch]] ReadNvSwitchStatusAllSwitches() returned Object is in an undefined state [/workspaces/dcgm-rel_dcgm_2_2-postmerge/modules/nvswitch/DcgmModuleNvSwitch.cpp:388] [DcgmNs::DcgmModuleNvSwitch::RunOnce] 2024-05-24 10:22:59.245 ERROR [5231:5273] [[NvSwitch]] ReadNvSwitchStatusAllSwitches() returned Object is in an undefined state [/workspaces/dcgm-rel_dcgm_2_2-postmerge/modules/nvswitch/DcgmModuleNvSwitch.cpp:388] [DcgmNs::DcgmModuleNvSwitch::RunOnce] 2024-05-24 10:23:29.246 ERROR [5231:5273] [[NvSwitch]] ReadNvSwitchStatusAllSwitches() returned Object is in an undefined state [/workspaces/dcgm-rel_dcgm_2_2-postmerge/modules/nvswitch/DcgmModuleNvSwitch.cpp:388] [DcgmNs::DcgmModuleNvSwitch::RunOnce] 2024-05-24 10:23:59.246 ERROR [5231:5273] [[NvSwitch]] ReadNvSwitchStatusAllSwitches() returned Object is in an undefined state [/workspaces/dcgm-rel_dcgm_2_2-postmerge/modules/nvswitch/DcgmModuleNvSwitch.cpp:388] [DcgmNs::DcgmModuleNvSwitch::RunOnce] 2024-05-24 10:24:29.246 ERROR [5231:5273] [[NvSwitch]] ReadNvSwitchStatusAllSwitches() returned Object is in an undefined state [/workspaces/dcgm-rel_dcgm_2_2-postmerge/modules/nvswitch/DcgmModuleNvSwitch.cpp:388] [DcgmNs::DcgmModuleNvSwitch::RunOnce] 2024-05-24 10:24:59.246 ERROR [5231:5273] [[NvSwitch]] ReadNvSwitchStatusAllSwitches() returned Object is in an undefined state [/workspaces/dcgm-rel_dcgm_2_2-postmerge/modules/nvswitch/DcgmModuleNvSwitch.cpp:388] [DcgmNs::DcgmModuleNvSwitch::RunOnce] 2024-05-24 10:25:29.246 ERROR [5231:5273] [[NvSwitch]] ReadNvSwitchStatusAllSwitches() returned Object is in an undefined state [/workspaces/dcgm-rel_dcgm_2_2-postmerge/modules/nvswitch/DcgmModuleNvSwitch.cpp:388] [DcgmNs::DcgmModuleNvSwitch::RunOnce] 2024-05-24 10:25:59.246 ERROR [5231:5273] [[NvSwitch]] ReadNvSwitchStatusAllSwitches() returned Object is in an undefined state [/workspaces/dcgm-rel_dcgm_2_2-postmerge/modules/nvswitch/DcgmModuleNvSwitch.cpp:388] [DcgmNs::DcgmModuleNvSwitch::RunOnce] 2024-05-24 10:26:29.246 ERROR [5231:5273] [[NvSwitch]] ReadNvSwitchStatusAllSwitches() returned Object is in an undefined state [/workspaces/dcgm-rel_dcgm_2_2-postmerge/modules/nvswitch/DcgmModuleNvSwitch.cpp:388] [DcgmNs::DcgmModuleNvSwitch::RunOnce] 2024-05-24 10:27:59.246 ERROR [5231:5273] [[NvSwitch]] ReadNvSwitchStatusAllSwitches() returned Object is in an undefined state [/workspaces/dcgm-rel_dcgm_2_2-postmerge/modules/nvswitch/DcgmModuleNvSwitch.cpp:388] [DcgmNs::DcgmModuleNvSwitch::RunOnce] 2024-05-24 10:28:29.247 ERROR [5231:5273] [[NvSwitch]] ReadNvSwitchStatusAllSwitches() returned Object is in an undefined state [/workspaces/dcgm-rel_dcgm_2_2-postmerge/modules/nvswitch/DcgmModuleNvSwitch.cpp:388] [DcgmNs::DcgmModuleNvSwitch::RunOnce] 2024-05-24 10:28:59.247 ERROR [5231:5273] [[NvSwitch]] ReadNvSwitchStatusAllSwitches() returned Object is in an undefined state [/workspaces/dcgm-rel_dcgm_2_2-postmerge/modules/nvswitch/DcgmModuleNvSwitch.cpp:388] [DcgmNs::DcgmModuleNvSwitch::RunOnce] 2024-05-24 10:29:29.247 ERROR [5231:5273] [[NvSwitch]] ReadNvSwitchStatusAllSwitches() returned Object is in an undefined state [/workspaces/dcgm-rel_dcgm_2_2-postmerge/modules/nvswitch/DcgmModuleNvSwitch.cpp:388] [DcgmNs::DcgmModuleNvSwitch::RunOnce] 2024-05-24 10:29:59.247 ERROR [5231:5273] [[NvSwitch]] ReadNvSwitchStatusAllSwitches() returned Object is in an undefined state [/workspaces/dcgm-rel_dcgm_2_2-postmerge/modules/nvswitch/DcgmModuleNvSwitch.cpp:388] [DcgmNs::DcgmModuleNvSwitch::RunOnce] 2024-05-24 10:30:29.247 ERROR [5231:5273] [[NvSwitch]] ReadNvSwitchStatusAllSwitches() returned Object is in an undefined state [/workspaces/dcgm-rel_dcgm_2_2-postmerge/modules/nvswitch/DcgmModuleNvSwitch.cpp:388] [DcgmNs::DcgmModuleNvSwitch::RunOnce] 2024-05-24 10:30:59.247 ERROR [5231:5273] [[NvSwitch]] ReadNvSwitchStatusAllSwitches() returned Object is in an undefined state [/workspaces/dcgm-rel_dcgm_2_2-postmerge/modules/nvswitch/DcgmModuleNvSwitch.cpp:388] [DcgmNs::DcgmModuleNvSwitch::RunOnce] 2024-05-24 10:31:29.247 ERROR [5231:5273] [[NvSwitch]] ReadNvSwitchStatusAllSwitches() returned Object is in an undefined state [/workspaces/dcgm-rel_dcgm_2_2-postmerge/modules/nvswitch/DcgmModuleNvSwitch.cpp:388] [DcgmNs::DcgmModuleNvSwitch::RunOnce] 2024-05-24 10:32:08.069 ERROR [5231:5273] [[NvSwitch]] ReadNvSwitchStatusAllSwitches() returned Object is in an undefined state [/workspaces/dcgm-rel_dcgm_2_2-postmerge/modules/nvswitch/DcgmModuleNvSwitch.cpp:388] [DcgmNs::DcgmModuleNvSwitch::RunOnce] 2024-05-24 10:32:38.069 ERROR [5231:5273] [[NvSwitch]] ReadNvSwitchStatusAllSwitches() returned Object is in an undefined state [/workspaces/dcgm-rel_dcgm_2_2-postmerge/modules/nvswitch/DcgmModuleNvSwitch.cpp:388] [DcgmNs::DcgmModuleNvSwitch::RunOnce] 2024-05-24 10:33:08.069 ERROR [5231:5273] [[NvSwitch]] ReadNvSwitchStatusAllSwitches() returned Object is in an undefined state [/workspaces/dcgm-rel_dcgm_2_2-postmerge/modules/nvswitch/DcgmModuleNvSwitch.cpp:388] [DcgmNs::DcgmModuleNvSwitch::RunOnce] 2024-05-24 10:33:38.069 ERROR [5231:5273] [[NvSwitch]] ReadNvSwitchStatusAllSwitches() returned Object is in an undefined state [/workspaces/dcgm-rel_dcgm_2_2-postmerge/modules/nvswitch/DcgmModuleNvSwitch.cpp:388] [DcgmNs::DcgmModuleNvSwitch::RunOnce] 2024-05-24 10:34:08.069 ERROR [5231:5273] [[NvSwitch]] ReadNvSwitchStatusAllSwitches() returned Object is in an undefined state [/workspaces/dcgm-rel_dcgm_2_2-postmerge/modules/nvswitch/DcgmModuleNvSwitch.cpp:388] [DcgmNs::DcgmModuleNvSwitch::RunOnce]

My Nvidia-smi: +-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 550.54.15 Driver Version: 550.54.15 CUDA Version: 12.4 | |-----------------------------------------+------------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+========================+======================| | 0 NVIDIA A800-SXM4-80GB On | 00000000:5B:00.0 Off | 0 | | N/A 31C P0 64W / 400W | 0MiB / 81920MiB | 0% Default | | | | Disabled | +-----------------------------------------+------------------------+----------------------+ | 1 NVIDIA A800-SXM4-80GB On | 00000000:5E:00.0 Off | 0 | | N/A 30C P0 59W / 400W | 0MiB / 81920MiB | 0% Default | | | | Disabled | +-----------------------------------------+------------------------+----------------------+

dcgmi -v: Version : 2.2.9 Build ID : 14 Build Date : 2021-07-23 Build Type : Release Commit ID : 3d9c443e28d491a942d3f0bbad0cf0579a20fdfd Branch Name : rel_dcgm_2_2 CPU Arch : x86_64 Build Platform : Linux 4.4.0-116-generic https://github.com/NVIDIA/dcgm-exporter/issues/140-Ubuntu SMP Mon Feb 12 21:23:04 UTC 2018 x86_64 CRC : a015d3b885ad821a2424294a80e2366e