Please provide an in-depth description of the question you have:
边缘节点上的vgu-monitor在监听pod 时获取为空。目前这个问题导致了节点上的pod内的gpu无法使用
I1016 08:05:59.448004 2842 cudevshr.go:167] Adding ctr dirname /usr/local/vgpu/containers/ebfbb069-a68a-4ea1-b5e0-31620323d43b_27637095363252224 in monitorpath
I1016 08:05:59.448046 2842 feedback.go:255] utSwitchon=map[GPU-bd17afd2-154d-8fe5-7be5-6eb460a8205c:[0 1]]
I1016 08:05:59.448087 2842 feedback.go:256] Setting UtilizationSwitch to off ebfbb069-a68a-4ea1-b5e0-31620323d43b_27637095363252224
E1018 09:19:26.388200 2842 reflector.go:147] pkg/mod/k8s.io/client-go@v0.28.3/tools/cache/reflector.go:229: Failed to watch *v1.Pod: an error on the server ("") has prevented the request from succeeding (get pods)
E1018 09:19:26.388317 2842 request.go:1116] Unexpected error when reading response body: unexpected EOF
E1018 09:19:26.389648 2842 feedback.go:269] Failed to update container list: unexpected error when reading response body. Please retry. Original error: unexpected EOF
E1018 18:33:43.467487 2842 feedback.go:269] Failed to update container list: can not cache for vgpumonitor list pods: /api/v1/pods
E1018 18:33:48.471971 2842 feedback.go:269] Failed to update container list: can not cache for vgpumonitor list pods: /api/v1/pods
E1018 18:33:53.475766 2842 feedback.go:269] Failed to update container list: can not cache for vgpumonitor list pods: /api/v1/pods
E1019 09:02:05.416382 2842 feedback.go:269] Failed to update container list: an error on the server ("") has prevented the request from succeeding (get pods)
E1019 09:02:10.422895 2842 feedback.go:269] Failed to update container list: can not cache for vgpumonitor list pods: /api/v1/pods
E1019 09:02:15.427714 2842 feedback.go:269] Failed to update container list: can not cache for vgpumonitor list pods: /api/v1/pods
E1019 09:02:20.431912 2842 feedback.go:269] Failed to update container list: can not cache for vgpumonitor list pods: /api/v1/pods
E1019 09:02:25.435888 2842 feedback.go:269] Failed to update container list: can not cache for vgpumonitor list pods: /api/v1/pods
E1019 09:02:30.439803 2842 feedback.go:269] Failed to update container list: can not cache for vgpumonitor list pods: /api/v1/pods
E1019 09:02:35.443847 2842 feedback.go:269] Failed to update container list: can not cache for vgpumonitor list pods: /api/v1/pods
E1019 09:02:40.447807 2842 feedback.go:269] Failed to update container list: can not cache for vgpumonitor list pods: /api/v1/pods
E1019 09:02:45.451884 2842 feedback.go:269] Failed to update container list: can not cache for vgpumonitor list pods: /api/v1/pods
E1019 09:02:50.455948 2842 feedback.go:269] Failed to update container list: can not cache for vgpumonitor list pods: /api/v1/pods
E1019 09:02:55.460905 2842 feedback.go:269] Failed to update container list: can not cache for vgpumonitor list pods: /api/v1/pods
E1019 09:03:00.464881 2842 feedback.go:269] Failed to update container list: can not cache for vgpumonitor list pods: /api/v1/pods
E1019 09:03:05.467922 2842 feedback.go:269] Failed to update container list: can not cache for vgpumonitor list pods: /api/v1/pods
E1019 09:03:10.470424 2842 feedback.go:269] Failed to update container list: can not cache for vgpumonitor list pods: /api/v1/pods
E1019 09:03:15.472801 2842 feedback.go:269] Failed to update container list: can not cache for vgpumonitor list pods: /api/v1/pods
E1019 09:03:20.475873 2842 feedback.go:269] Failed to update container list: can not cache for vgpumonitor list pods: /api/v1/pods
E1019 09:03:25.479862 2842 feedback.go:269] Failed to update container list: can not cache for vgpumonitor list pods: /api/v1/pods
E1019 09:03:30.484702 2842 feedback.go:269] Failed to update container list: can not cache for vgpumonitor list pods: /api/v1/pods
E1019 09:03:35.486580 2842 feedback.go:269] Failed to update container list: can not cache for vgpumonitor list pods: /api/v1/pods
E1019 09:03:40.491808 2842 feedback.go:269] Failed to update container list: can not cache for vgpumonitor list pods: /api/v1/pods
E1020 01:57:13.623823 2842 feedback.go:269] Failed to update container list: can not cache for vgpumonitor list pods: /api/v1/pods
E1020 01:57:18.627811 2842 feedback.go:269] Failed to update container list: can not cache for vgpumonitor list pods: /api/v1/pods
E1020 01:57:23.633409 2842 feedback.go:269] Failed to update container list: can not cache for vgpumonitor list pods: /api/v1/pods
E1020 01:57:28.638264 2842 feedback.go:269] Failed to update container list: can not cache for vgpumonitor list pods: /api/v1/pods
E1020 01:57:33.639872 2842 feedback.go:269] Failed to update container list: can not cache for vgpumonitor list pods: /api/v1/pods
E1023 06:37:32.725273 2842 request.go:1116] Unexpected error when reading response body: unexpected EOF
E1023 06:37:32.727300 2842 feedback.go:269] Failed to update container list: unexpected error when reading response body. Please retry. Original error: unexpected EOF
E1023 06:38:54.781895 2842 feedback.go:269] Failed to update container list: an error on the server ("") has prevented the request from succeeding (get pods)
E1024 20:34:08.045679 2842 feedback.go:269] Failed to update container list: an error on the server ("") has prevented the request from succeeding (get pods)
E1024 20:34:13.051113 2842 feedback.go:269] Failed to update container list: can not cache for vgpumonitor list pods: /api/v1/pods
E1024 20:34:18.055952 2842 feedback.go:269] Failed to update container list: can not cache for vgpumonitor list pods: /api/v1/pods
E1024 20:34:23.061614 2842 feedback.go:269] Failed to update container list: can not cache for vgpumonitor list pods: /api/v1/pods
E1024 20:34:28.063871 2842 feedback.go:269] Failed to update container list: can not cache for vgpumonitor list pods: /api/v1/pods
E1024 20:34:33.069520 2842 feedback.go:269] Failed to update container list: can not cache for vgpumonitor list pods: /api/v1/pods
What do you think about this question?:
目前有多个节点出现这种问题,而且边缘节点master之间的集群内的网络边缘节点和master是不通的。目前观察来看,都是稳定运行一段时间之后,插件出现问题重启节点上的device-plugin恢复正常。
Please provide an in-depth description of the question you have: 边缘节点上的vgu-monitor在监听pod 时获取为空。目前这个问题导致了节点上的pod内的gpu无法使用
What do you think about this question?: 目前有多个节点出现这种问题,而且边缘节点master之间的集群内的网络边缘节点和master是不通的。目前观察来看,都是稳定运行一段时间之后,插件出现问题重启节点上的device-plugin恢复正常。
Environment: