Open roehrich-hpe opened 4 months ago
A complete log from one of the pods:
2024-02-14T08:15:01.773-0800 INFO setup GOMAXPROCS {"value": 128}
2024-02-14T08:15:01.774-0800 INFO setup starting manager
2024-02-14T08:15:01.774-0800 INFO controller-runtime.metrics Starting metrics server
2024-02-14T08:15:01.774-0800 INFO starting server {"kind": "health probe", "addr": "[::]:8081"}
2024-02-14T08:15:01.774-0800 INFO controller-runtime.metrics Serving metrics server {"bindAddress": ":8080", "secure": false}
2024-02-14T08:15:01.775-0800 INFO Starting EventSource {"controller": "clientmount", "controllerGroup": "dataworkflowservices.github.io", "controllerKind": "ClientMount", "source": "kind source: *v1alpha2.ClientMount"}
2024-02-14T08:15:01.775-0800 INFO Starting Controller {"controller": "clientmount", "controllerGroup": "dataworkflowservices.github.io", "controllerKind": "ClientMount"}
2024-02-14T08:15:01.775-0800 INFO Starting EventSource {"controller": "nnfnodeecdata", "controllerGroup": "nnf.cray.hpe.com", "controllerKind": "NnfNodeECData", "source": "kind source: *v1alpha1.NnfNodeECData"}
2024-02-14T08:15:01.775-0800 INFO Starting Controller {"controller": "nnfnodeecdata", "controllerGroup": "nnf.cray.hpe.com", "controllerKind": "NnfNodeECData"}
2024-02-14T08:15:01.775-0800 INFO Starting EventSource {"controller": "nnfnode", "controllerGroup": "nnf.cray.hpe.com", "controllerKind": "NnfNode", "source": "kind source: *v1alpha1.NnfNode"}
2024-02-14T08:15:01.775-0800 INFO Starting EventSource {"controller": "nnfnode", "controllerGroup": "nnf.cray.hpe.com", "controllerKind": "NnfNode", "source": "kind source: *v1.Namespace"}
2024-02-14T08:15:01.776-0800 INFO Starting EventSource {"controller": "nnfnode", "controllerGroup": "nnf.cray.hpe.com", "controllerKind": "NnfNode", "source": "kind source: *v1alpha2.SystemConfiguration"}
2024-02-14T08:15:01.776-0800 INFO Starting Controller {"controller": "nnfnode", "controllerGroup": "nnf.cray.hpe.com", "controllerKind": "NnfNode"}
2024-02-14T08:15:01.776-0800 INFO Starting EventSource {"controller": "nnfnodeblockstorage", "controllerGroup": "nnf.cray.hpe.com", "controllerKind": "NnfNodeBlockStorage", "source": "kind source: *v1alpha1.NnfNodeBlockStorage"}
2024-02-14T08:15:01.776-0800 INFO Starting Controller {"controller": "nnfnodeblockstorage", "controllerGroup": "nnf.cray.hpe.com", "controllerKind": "NnfNodeBlockStorage"}
2024-02-14T08:15:01.776-0800 INFO Starting EventSource {"controller": "nnfnodestorage", "controllerGroup": "nnf.cray.hpe.com", "controllerKind": "NnfNodeStorage", "source": "kind source: *v1alpha1.NnfNodeStorage"}
2024-02-14T08:15:01.776-0800 INFO Starting Controller {"controller": "nnfnodestorage", "controllerGroup": "nnf.cray.hpe.com", "controllerKind": "NnfNodeStorage"}
2024-02-14T08:15:01.944-0800 INFO Stopping and waiting for non leader election runnables
2024-02-14T08:15:01.944-0800 INFO Stopping and waiting for leader election runnables
2024-02-14T08:15:01.944-0800 INFO Starting workers {"controller": "nnfnodeblockstorage", "controllerGroup": "nnf.cray.hpe.com", "controllerKind": "NnfNodeBlockStorage", "worker count": 1}
2024-02-14T08:15:01.944-0800 INFO Shutdown signal received, waiting for all workers to finish {"controller": "nnfnodeblockstorage", "controllerGroup": "nnf.cray.hpe.com", "controllerKind": "NnfNodeBlockStorage"}
2024-02-14T08:15:01.944-0800 INFO All workers finished {"controller": "nnfnodeblockstorage", "controllerGroup": "nnf.cray.hpe.com", "controllerKind": "NnfNodeBlockStorage"}
2024-02-14T08:15:01.944-0800 INFO Starting workers {"controller": "nnfnodestorage", "controllerGroup": "nnf.cray.hpe.com", "controllerKind": "NnfNodeStorage", "worker count": 128}
2024-02-14T08:15:01.944-0800 INFO Starting workers {"controller": "nnfnode", "controllerGroup": "nnf.cray.hpe.com", "controllerKind": "NnfNode", "worker count": 1}
2024-02-14T08:15:01.944-0800 INFO Starting workers {"controller": "clientmount", "controllerGroup": "dataworkflowservices.github.io", "controllerKind": "ClientMount", "worker count": 128}
2024-02-14T08:15:01.944-0800 INFO Shutdown signal received, waiting for all workers to finish {"controller": "nnfnode", "controllerGroup": "nnf.cray.hpe.com", "controllerKind": "NnfNode"}
2024-02-14T08:15:01.944-0800 ERROR controllers.NnfNode get namespace failed {"NnfNode": {"name":"nnf-nlc","namespace":"elcap255"}, "State": "Start", "error": "Timeout: failed waiting for *v1.Namespace Informer to sync"}
github.com/NearNodeFlash/nnf-sos/internal/controller.(*NnfNodeReconciler).Start
/workspace/internal/controller/nnf_node_controller.go:96
sigs.k8s.io/controller-runtime/pkg/manager.(*runnableGroup).reconcile.func1
/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/manager/runnable_group.go:223
2024-02-14T08:15:01.944-0800 INFO All workers finished {"controller": "nnfnode", "controllerGroup": "nnf.cray.hpe.com", "controllerKind": "NnfNode"}
2024-02-14T08:15:01.944-0800 ERROR controller-runtime.source.EventHandler failed to get informer from cache {"error": "Timeout: failed waiting for *v1alpha1.NnfNodeBlockStorage Informer to sync"}
sigs.k8s.io/controller-runtime/pkg/internal/source.(*Kind).Start.func1.1
/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/source/kind.go:68
k8s.io/apimachinery/pkg/util/wait.loopConditionUntilContext.func1
/workspace/vendor/k8s.io/apimachinery/pkg/util/wait/loop.go:49
k8s.io/apimachinery/pkg/util/wait.loopConditionUntilContext
/workspace/vendor/k8s.io/apimachinery/pkg/util/wait/loop.go:50
k8s.io/apimachinery/pkg/util/wait.PollUntilContextCancel
/workspace/vendor/k8s.io/apimachinery/pkg/util/wait/poll.go:33
sigs.k8s.io/controller-runtime/pkg/internal/source.(*Kind).Start.func1
/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/source/kind.go:56
2024-02-14T08:15:01.944-0800 ERROR error received after stop sequence was engaged {"error": "Timeout: failed waiting for *v1.Namespace Informer to sync"}
sigs.k8s.io/controller-runtime/pkg/manager.(*controllerManager).engageStopProcedure.func1
/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/manager/internal.go:490
2024-02-14T08:15:01.944-0800 INFO Shutdown signal received, waiting for all workers to finish {"controller": "nnfnodestorage", "controllerGroup": "nnf.cray.hpe.com", "controllerKind": "NnfNodeStorage"}
2024-02-14T08:15:01.944-0800 INFO All workers finished {"controller": "nnfnodestorage", "controllerGroup": "nnf.cray.hpe.com", "controllerKind": "NnfNodeStorage"}
2024-02-14T08:15:01.944-0800 ERROR controller-runtime.source.EventHandler failed to get informer from cache {"error": "Timeout: failed waiting for *v1alpha2.ClientMount Informer to sync"}
sigs.k8s.io/controller-runtime/pkg/internal/source.(*Kind).Start.func1.1
/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/source/kind.go:68
k8s.io/apimachinery/pkg/util/wait.loopConditionUntilContext.func1
/workspace/vendor/k8s.io/apimachinery/pkg/util/wait/loop.go:49
k8s.io/apimachinery/pkg/util/wait.loopConditionUntilContext
/workspace/vendor/k8s.io/apimachinery/pkg/util/wait/loop.go:50
k8s.io/apimachinery/pkg/util/wait.PollUntilContextCancel
/workspace/vendor/k8s.io/apimachinery/pkg/util/wait/poll.go:33
sigs.k8s.io/controller-runtime/pkg/internal/source.(*Kind).Start.func1
/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/source/kind.go:56
2024-02-14T08:15:01.944-0800 INFO Starting workers {"controller": "nnfnodeecdata", "controllerGroup": "nnf.cray.hpe.com", "controllerKind": "NnfNodeECData", "worker count": 1}
2024-02-14T08:15:01.944-0800 INFO Shutdown signal received, waiting for all workers to finish {"controller": "nnfnodeecdata", "controllerGroup": "nnf.cray.hpe.com", "controllerKind": "NnfNodeECData"}
2024-02-14T08:15:01.944-0800 INFO All workers finished {"controller": "nnfnodeecdata", "controllerGroup": "nnf.cray.hpe.com", "controllerKind": "NnfNodeECData"}
2024-02-14T08:15:01.944-0800 ERROR controller-runtime.source.EventHandler failed to get informer from cache {"error": "Timeout: failed waiting for *v1.Namespace Informer to sync"}
sigs.k8s.io/controller-runtime/pkg/internal/source.(*Kind).Start.func1.1
/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/source/kind.go:68
k8s.io/apimachinery/pkg/util/wait.loopConditionUntilContext.func1
/workspace/vendor/k8s.io/apimachinery/pkg/util/wait/loop.go:49
k8s.io/apimachinery/pkg/util/wait.loopConditionUntilContext
/workspace/vendor/k8s.io/apimachinery/pkg/util/wait/loop.go:50
k8s.io/apimachinery/pkg/util/wait.PollUntilContextCancel
/workspace/vendor/k8s.io/apimachinery/pkg/util/wait/poll.go:33
sigs.k8s.io/controller-runtime/pkg/internal/source.(*Kind).Start.func1
/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/source/kind.go:56
2024-02-14T08:15:01.944-0800 ERROR controller-runtime.source.EventHandler failed to get informer from cache {"error": "Timeout: failed waiting for *v1alpha1.NnfNode Informer to sync"}
sigs.k8s.io/controller-runtime/pkg/internal/source.(*Kind).Start.func1.1
/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/source/kind.go:68
k8s.io/apimachinery/pkg/util/wait.loopConditionUntilContext.func1
/workspace/vendor/k8s.io/apimachinery/pkg/util/wait/loop.go:49
k8s.io/apimachinery/pkg/util/wait.loopConditionUntilContext
/workspace/vendor/k8s.io/apimachinery/pkg/util/wait/loop.go:50
k8s.io/apimachinery/pkg/util/wait.PollUntilContextCancel
/workspace/vendor/k8s.io/apimachinery/pkg/util/wait/poll.go:33
sigs.k8s.io/controller-runtime/pkg/internal/source.(*Kind).Start.func1
/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/source/kind.go:56
2024-02-14T08:15:01.944-0800 ERROR controller-runtime.source.EventHandler failed to get informer from cache {"error": "Timeout: failed waiting for *v1alpha1.NnfNodeStorage Informer to sync"}
sigs.k8s.io/controller-runtime/pkg/internal/source.(*Kind).Start.func1.1
/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/source/kind.go:68
k8s.io/apimachinery/pkg/util/wait.loopConditionUntilContext.func1
/workspace/vendor/k8s.io/apimachinery/pkg/util/wait/loop.go:49
k8s.io/apimachinery/pkg/util/wait.loopConditionUntilContext
/workspace/vendor/k8s.io/apimachinery/pkg/util/wait/loop.go:50
k8s.io/apimachinery/pkg/util/wait.PollUntilContextCancel
/workspace/vendor/k8s.io/apimachinery/pkg/util/wait/poll.go:33
sigs.k8s.io/controller-runtime/pkg/internal/source.(*Kind).Start.func1
/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/source/kind.go:56
2024-02-14T08:15:01.944-0800 INFO Shutdown signal received, waiting for all workers to finish {"controller": "clientmount", "controllerGroup": "dataworkflowservices.github.io", "controllerKind": "ClientMount"}
2024-02-14T08:15:01.944-0800 INFO All workers finished {"controller": "clientmount", "controllerGroup": "dataworkflowservices.github.io", "controllerKind": "ClientMount"}
2024-02-14T08:15:01.944-0800 INFO Stopping and waiting for caches
2024-02-14T08:15:01.944-0800 INFO Stopping and waiting for webhooks
2024-02-14T08:15:01.944-0800 INFO Stopping and waiting for HTTP servers
2024-02-14T08:15:01.944-0800 INFO shutting down server {"kind": "health probe", "addr": "[::]:8081"}
2024-02-14T08:15:01.944-0800 INFO controller-runtime.metrics Shutting down metrics server with timeout of 1 minute
2024-02-14T08:15:01.944-0800 INFO Wait completed, proceeding to shutdown the manager
2024-02-14T08:15:01.944-0800 ERROR setup problem running manager {"error": "namespaces \"elcap255\" not found"}
main.main
/workspace/cmd/main.go:146
runtime.main
/usr/local/go/src/runtime/proc.go:250
I tried deleting one of the pods. The DS started a new pod on that node but it encountered the same problem.
Sampling one of those nodes, the nnf-dm-worker pod is showing a problem that may be related:
2024-02-13T21:15:04.139Z INFO setup starting manager
2024-02-13T21:15:04.139Z INFO controller-runtime.metrics Starting metrics server
2024-02-13T21:15:04.139Z INFO starting server {"kind": "health probe", "addr": "[::]:8081"}
2024-02-13T21:15:04.139Z INFO controller-runtime.metrics Serving metrics server {"bindAddress": ":8080", "secure": false}
2024-02-13T21:15:04.139Z INFO Starting EventSource {"controller": "nnfdatamovement", "controllerGroup": "nnf.cray.hpe.com", "controllerKind": "NnfDataMovement", "source": "kind source: *v1alpha1.NnfDataMovement"}
2024-02-13T21:15:04.139Z INFO Starting Controller {"controller": "nnfdatamovement", "controllerGroup": "nnf.cray.hpe.com", "controllerKind": "NnfDataMovement"}
2024-02-13T21:15:04.449Z INFO Starting workers {"controller": "nnfdatamovement", "controllerGroup": "nnf.cray.hpe.com", "controllerKind": "NnfDataMovement", "worker count": 128}
W0214 15:40:10.343065 13 reflector.go:458] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers.go:208: watch of *v1alpha1.NnfDataMovement ended with: an error on the server ("unable to decode an event from the watch stream: http2: client connection lost") has prevented the request from succeeding
W0214 15:40:10.343067 13 reflector.go:458] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers.go:208: watch of *v1alpha1.NnfDataMovement ended with: an error on the server ("unable to decode an event from the watch stream: http2: client connection lost") has prevented the request from succeeding
W0214 15:40:10.343091 13 reflector.go:458] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers.go:208: watch of *v1alpha1.NnfDataMovement ended with: an error on the server ("unable to decode an event from the watch stream: http2: client connection lost") has prevented the request from succeeding
The kube-proxy and lustre-csi-node on that node do not show problems.
I was able to confirm that the crt and token are valid. I used kubectl debug
to create a debug pod from one of the CrashLoopBackOff pods, on the same node, and start a shell in the nnf-sos image. That nnf-sos image has a kubectl command in it. I used that to access the kubeapi-server the same way the controller would do it (see client-go's InClusterConfig()), and that works. Then, while in the debug pod, I started /manager
and saw it fail as shown above.
The errors above are happening with controller-runtime@v0.16.2 and go 1.19.
I tried this with controller-runtime@v0.17.2 (latest at this moment) and go 1.21 and got the same results.
Today none of the nodes has a nnf-node-manager controller that is in CLBO. I have stopped all of the pods in the Daemonset and restarted them and there are no CLBOs. I do not know why the problem has gone away; the system is running the original containers, not my debug containers.
We have a subset of nnf-node-manager pods in CrashLoopBackOff. All show the same messages in their logs. This is nnf-deploy v0.0.8.
All controllers in nnf-node-manager are failing. The NnfNode controller's Start() method is reporting the following before it can create the namespace:
All other controllers in the manager are showing that same timeout waiting for Informer to sync. Here's one for NnfNodeStorage:
The backtraces on all of the controllers are identical.
Finally, the last messages in the log:
That's the
mgr.Start()
call from main.