Project-HAMi / HAMi

Heterogeneous AI Computing Virtualization Middleware
http://project-hami.io/
Apache License 2.0
957 stars 197 forks source link

运行中的vgpu-monitor获取pod列表报错 #576

Closed zhangchi6414 closed 3 weeks ago

zhangchi6414 commented 3 weeks ago

Please provide an in-depth description of the question you have: 边缘节点上的vgu-monitor在监听pod 时获取为空。目前这个问题导致了节点上的pod内的gpu无法使用

I1016 08:05:59.448004    2842 cudevshr.go:167] Adding ctr dirname /usr/local/vgpu/containers/ebfbb069-a68a-4ea1-b5e0-31620323d43b_27637095363252224 in monitorpath
I1016 08:05:59.448046    2842 feedback.go:255] utSwitchon=map[GPU-bd17afd2-154d-8fe5-7be5-6eb460a8205c:[0 1]]
I1016 08:05:59.448087    2842 feedback.go:256] Setting UtilizationSwitch to off ebfbb069-a68a-4ea1-b5e0-31620323d43b_27637095363252224
E1018 09:19:26.388200    2842 reflector.go:147] pkg/mod/k8s.io/client-go@v0.28.3/tools/cache/reflector.go:229: Failed to watch *v1.Pod: an error on the server ("") has prevented the request from succeeding (get pods)
E1018 09:19:26.388317    2842 request.go:1116] Unexpected error when reading response body: unexpected EOF
E1018 09:19:26.389648    2842 feedback.go:269] Failed to update container list: unexpected error when reading response body. Please retry. Original error: unexpected EOF
E1018 18:33:43.467487    2842 feedback.go:269] Failed to update container list: can not cache for vgpumonitor list pods: /api/v1/pods
E1018 18:33:48.471971    2842 feedback.go:269] Failed to update container list: can not cache for vgpumonitor list pods: /api/v1/pods
E1018 18:33:53.475766    2842 feedback.go:269] Failed to update container list: can not cache for vgpumonitor list pods: /api/v1/pods
E1019 09:02:05.416382    2842 feedback.go:269] Failed to update container list: an error on the server ("") has prevented the request from succeeding (get pods)
E1019 09:02:10.422895    2842 feedback.go:269] Failed to update container list: can not cache for vgpumonitor list pods: /api/v1/pods
E1019 09:02:15.427714    2842 feedback.go:269] Failed to update container list: can not cache for vgpumonitor list pods: /api/v1/pods
E1019 09:02:20.431912    2842 feedback.go:269] Failed to update container list: can not cache for vgpumonitor list pods: /api/v1/pods
E1019 09:02:25.435888    2842 feedback.go:269] Failed to update container list: can not cache for vgpumonitor list pods: /api/v1/pods
E1019 09:02:30.439803    2842 feedback.go:269] Failed to update container list: can not cache for vgpumonitor list pods: /api/v1/pods
E1019 09:02:35.443847    2842 feedback.go:269] Failed to update container list: can not cache for vgpumonitor list pods: /api/v1/pods
E1019 09:02:40.447807    2842 feedback.go:269] Failed to update container list: can not cache for vgpumonitor list pods: /api/v1/pods
E1019 09:02:45.451884    2842 feedback.go:269] Failed to update container list: can not cache for vgpumonitor list pods: /api/v1/pods
E1019 09:02:50.455948    2842 feedback.go:269] Failed to update container list: can not cache for vgpumonitor list pods: /api/v1/pods
E1019 09:02:55.460905    2842 feedback.go:269] Failed to update container list: can not cache for vgpumonitor list pods: /api/v1/pods
E1019 09:03:00.464881    2842 feedback.go:269] Failed to update container list: can not cache for vgpumonitor list pods: /api/v1/pods
E1019 09:03:05.467922    2842 feedback.go:269] Failed to update container list: can not cache for vgpumonitor list pods: /api/v1/pods
E1019 09:03:10.470424    2842 feedback.go:269] Failed to update container list: can not cache for vgpumonitor list pods: /api/v1/pods
E1019 09:03:15.472801    2842 feedback.go:269] Failed to update container list: can not cache for vgpumonitor list pods: /api/v1/pods
E1019 09:03:20.475873    2842 feedback.go:269] Failed to update container list: can not cache for vgpumonitor list pods: /api/v1/pods
E1019 09:03:25.479862    2842 feedback.go:269] Failed to update container list: can not cache for vgpumonitor list pods: /api/v1/pods
E1019 09:03:30.484702    2842 feedback.go:269] Failed to update container list: can not cache for vgpumonitor list pods: /api/v1/pods
E1019 09:03:35.486580    2842 feedback.go:269] Failed to update container list: can not cache for vgpumonitor list pods: /api/v1/pods
E1019 09:03:40.491808    2842 feedback.go:269] Failed to update container list: can not cache for vgpumonitor list pods: /api/v1/pods
E1020 01:57:13.623823    2842 feedback.go:269] Failed to update container list: can not cache for vgpumonitor list pods: /api/v1/pods
E1020 01:57:18.627811    2842 feedback.go:269] Failed to update container list: can not cache for vgpumonitor list pods: /api/v1/pods
E1020 01:57:23.633409    2842 feedback.go:269] Failed to update container list: can not cache for vgpumonitor list pods: /api/v1/pods
E1020 01:57:28.638264    2842 feedback.go:269] Failed to update container list: can not cache for vgpumonitor list pods: /api/v1/pods
E1020 01:57:33.639872    2842 feedback.go:269] Failed to update container list: can not cache for vgpumonitor list pods: /api/v1/pods
E1023 06:37:32.725273    2842 request.go:1116] Unexpected error when reading response body: unexpected EOF
E1023 06:37:32.727300    2842 feedback.go:269] Failed to update container list: unexpected error when reading response body. Please retry. Original error: unexpected EOF
E1023 06:38:54.781895    2842 feedback.go:269] Failed to update container list: an error on the server ("") has prevented the request from succeeding (get pods)
E1024 20:34:08.045679    2842 feedback.go:269] Failed to update container list: an error on the server ("") has prevented the request from succeeding (get pods)
E1024 20:34:13.051113    2842 feedback.go:269] Failed to update container list: can not cache for vgpumonitor list pods: /api/v1/pods
E1024 20:34:18.055952    2842 feedback.go:269] Failed to update container list: can not cache for vgpumonitor list pods: /api/v1/pods
E1024 20:34:23.061614    2842 feedback.go:269] Failed to update container list: can not cache for vgpumonitor list pods: /api/v1/pods
E1024 20:34:28.063871    2842 feedback.go:269] Failed to update container list: can not cache for vgpumonitor list pods: /api/v1/pods
E1024 20:34:33.069520    2842 feedback.go:269] Failed to update container list: can not cache for vgpumonitor list pods: /api/v1/pods

What do you think about this question?: 目前有多个节点出现这种问题,而且边缘节点master之间的集群内的网络边缘节点和master是不通的。目前观察来看,都是稳定运行一段时间之后,插件出现问题重启节点上的device-plugin恢复正常。

Environment:

archlitchi commented 3 weeks ago

嗯,这里应该是网络的问题,看报错事由于边缘段无法访问apiserver导致的