nnf-node-manager CrashLoopBackOff: failed waiting for Informer to sync

roehrich-hpe commented 4 months ago

We have a subset of nnf-node-manager pods in CrashLoopBackOff. All show the same messages in their logs. This is nnf-deploy v0.0.8.

All controllers in nnf-node-manager are failing. The NnfNode controller's Start() method is reporting the following before it can create the namespace:

2024-02-14T07:50:55.310-0800    ERROR   controllers.NnfNode     get namespace failed    {"NnfNode": {"name":"nnf-nlc","namespace":"elcap442"}, "State": "Start", "error": "Timeout: failed waiting for *v1.Namespace Informer to sync"}
github.com/NearNodeFlash/nnf-sos/internal/controller.(*NnfNodeReconciler).Start
        /workspace/internal/controller/nnf_node_controller.go:96
sigs.k8s.io/controller-runtime/pkg/manager.(*runnableGroup).reconcile.func1
        /workspace/vendor/sigs.k8s.io/controller-runtime/pkg/manager/runnable_group.go:223

All other controllers in the manager are showing that same timeout waiting for Informer to sync. Here's one for NnfNodeStorage:

2024-02-14T07:50:55.311-0800    ERROR   controller-runtime.source.EventHandler  
failed to get informer from cache       {"error": "Timeout: failed waiting for *
v1alpha1.NnfNodeStorage Informer to sync"}
sigs.k8s.io/controller-runtime/pkg/internal/source.(*Kind).Start.func1.1
        /workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/source/kin
d.go:68
k8s.io/apimachinery/pkg/util/wait.loopConditionUntilContext.func1
        /workspace/vendor/k8s.io/apimachinery/pkg/util/wait/loop.go:49
k8s.io/apimachinery/pkg/util/wait.loopConditionUntilContext
        /workspace/vendor/k8s.io/apimachinery/pkg/util/wait/loop.go:50
k8s.io/apimachinery/pkg/util/wait.PollUntilContextCancel
        /workspace/vendor/k8s.io/apimachinery/pkg/util/wait/poll.go:33
sigs.k8s.io/controller-runtime/pkg/internal/source.(*Kind).Start.func1
        /workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/source/kin
d.go:56

The backtraces on all of the controllers are identical.

Finally, the last messages in the log:

2024-02-14T07:50:55.312-0800    INFO    shutting down server    {"kind": "health probe", "addr": "[::]:8081"}
2024-02-14T07:50:55.312-0800    INFO    Wait completed, proceeding to shutdown the manager
2024-02-14T07:50:55.312-0800    ERROR   setup   problem running manager {"error": "namespaces \"elcap442\" not found"}
main.main
        /workspace/cmd/main.go:146
runtime.main
        /usr/local/go/src/runtime/proc.go:250

That's the mgr.Start() call from main.

roehrich-hpe commented 4 months ago

A complete log from one of the pods:

2024-02-14T08:15:01.773-0800    INFO    setup   GOMAXPROCS  {"value": 128}
2024-02-14T08:15:01.774-0800    INFO    setup   starting manager
2024-02-14T08:15:01.774-0800    INFO    controller-runtime.metrics  Starting metrics server
2024-02-14T08:15:01.774-0800    INFO    starting server {"kind": "health probe", "addr": "[::]:8081"}
2024-02-14T08:15:01.774-0800    INFO    controller-runtime.metrics  Serving metrics server  {"bindAddress": ":8080", "secure": false}
2024-02-14T08:15:01.775-0800    INFO    Starting EventSource    {"controller": "clientmount", "controllerGroup": "dataworkflowservices.github.io", "controllerKind": "ClientMount", "source": "kind source: *v1alpha2.ClientMount"}
2024-02-14T08:15:01.775-0800    INFO    Starting Controller {"controller": "clientmount", "controllerGroup": "dataworkflowservices.github.io", "controllerKind": "ClientMount"}
2024-02-14T08:15:01.775-0800    INFO    Starting EventSource    {"controller": "nnfnodeecdata", "controllerGroup": "nnf.cray.hpe.com", "controllerKind": "NnfNodeECData", "source": "kind source: *v1alpha1.NnfNodeECData"}
2024-02-14T08:15:01.775-0800    INFO    Starting Controller {"controller": "nnfnodeecdata", "controllerGroup": "nnf.cray.hpe.com", "controllerKind": "NnfNodeECData"}
2024-02-14T08:15:01.775-0800    INFO    Starting EventSource    {"controller": "nnfnode", "controllerGroup": "nnf.cray.hpe.com", "controllerKind": "NnfNode", "source": "kind source: *v1alpha1.NnfNode"}
2024-02-14T08:15:01.775-0800    INFO    Starting EventSource    {"controller": "nnfnode", "controllerGroup": "nnf.cray.hpe.com", "controllerKind": "NnfNode", "source": "kind source: *v1.Namespace"}
2024-02-14T08:15:01.776-0800    INFO    Starting EventSource    {"controller": "nnfnode", "controllerGroup": "nnf.cray.hpe.com", "controllerKind": "NnfNode", "source": "kind source: *v1alpha2.SystemConfiguration"}
2024-02-14T08:15:01.776-0800    INFO    Starting Controller {"controller": "nnfnode", "controllerGroup": "nnf.cray.hpe.com", "controllerKind": "NnfNode"}
2024-02-14T08:15:01.776-0800    INFO    Starting EventSource    {"controller": "nnfnodeblockstorage", "controllerGroup": "nnf.cray.hpe.com", "controllerKind": "NnfNodeBlockStorage", "source": "kind source: *v1alpha1.NnfNodeBlockStorage"}
2024-02-14T08:15:01.776-0800    INFO    Starting Controller {"controller": "nnfnodeblockstorage", "controllerGroup": "nnf.cray.hpe.com", "controllerKind": "NnfNodeBlockStorage"}
2024-02-14T08:15:01.776-0800    INFO    Starting EventSource    {"controller": "nnfnodestorage", "controllerGroup": "nnf.cray.hpe.com", "controllerKind": "NnfNodeStorage", "source": "kind source: *v1alpha1.NnfNodeStorage"}
2024-02-14T08:15:01.776-0800    INFO    Starting Controller {"controller": "nnfnodestorage", "controllerGroup": "nnf.cray.hpe.com", "controllerKind": "NnfNodeStorage"}
2024-02-14T08:15:01.944-0800    INFO    Stopping and waiting for non leader election runnables
2024-02-14T08:15:01.944-0800    INFO    Stopping and waiting for leader election runnables
2024-02-14T08:15:01.944-0800    INFO    Starting workers    {"controller": "nnfnodeblockstorage", "controllerGroup": "nnf.cray.hpe.com", "controllerKind": "NnfNodeBlockStorage", "worker count": 1}
2024-02-14T08:15:01.944-0800    INFO    Shutdown signal received, waiting for all workers to finish {"controller": "nnfnodeblockstorage", "controllerGroup": "nnf.cray.hpe.com", "controllerKind": "NnfNodeBlockStorage"}
2024-02-14T08:15:01.944-0800    INFO    All workers finished    {"controller": "nnfnodeblockstorage", "controllerGroup": "nnf.cray.hpe.com", "controllerKind": "NnfNodeBlockStorage"}
2024-02-14T08:15:01.944-0800    INFO    Starting workers    {"controller": "nnfnodestorage", "controllerGroup": "nnf.cray.hpe.com", "controllerKind": "NnfNodeStorage", "worker count": 128}
2024-02-14T08:15:01.944-0800    INFO    Starting workers    {"controller": "nnfnode", "controllerGroup": "nnf.cray.hpe.com", "controllerKind": "NnfNode", "worker count": 1}
2024-02-14T08:15:01.944-0800    INFO    Starting workers    {"controller": "clientmount", "controllerGroup": "dataworkflowservices.github.io", "controllerKind": "ClientMount", "worker count": 128}
2024-02-14T08:15:01.944-0800    INFO    Shutdown signal received, waiting for all workers to finish {"controller": "nnfnode", "controllerGroup": "nnf.cray.hpe.com", "controllerKind": "NnfNode"}
2024-02-14T08:15:01.944-0800    ERROR   controllers.NnfNode get namespace failed    {"NnfNode": {"name":"nnf-nlc","namespace":"elcap255"}, "State": "Start", "error": "Timeout: failed waiting for *v1.Namespace Informer to sync"}
github.com/NearNodeFlash/nnf-sos/internal/controller.(*NnfNodeReconciler).Start
    /workspace/internal/controller/nnf_node_controller.go:96
sigs.k8s.io/controller-runtime/pkg/manager.(*runnableGroup).reconcile.func1
    /workspace/vendor/sigs.k8s.io/controller-runtime/pkg/manager/runnable_group.go:223
2024-02-14T08:15:01.944-0800    INFO    All workers finished    {"controller": "nnfnode", "controllerGroup": "nnf.cray.hpe.com", "controllerKind": "NnfNode"}
2024-02-14T08:15:01.944-0800    ERROR   controller-runtime.source.EventHandler  failed to get informer from cache   {"error": "Timeout: failed waiting for *v1alpha1.NnfNodeBlockStorage Informer to sync"}
sigs.k8s.io/controller-runtime/pkg/internal/source.(*Kind).Start.func1.1
    /workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/source/kind.go:68
k8s.io/apimachinery/pkg/util/wait.loopConditionUntilContext.func1
    /workspace/vendor/k8s.io/apimachinery/pkg/util/wait/loop.go:49
k8s.io/apimachinery/pkg/util/wait.loopConditionUntilContext
    /workspace/vendor/k8s.io/apimachinery/pkg/util/wait/loop.go:50
k8s.io/apimachinery/pkg/util/wait.PollUntilContextCancel
    /workspace/vendor/k8s.io/apimachinery/pkg/util/wait/poll.go:33
sigs.k8s.io/controller-runtime/pkg/internal/source.(*Kind).Start.func1
    /workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/source/kind.go:56
2024-02-14T08:15:01.944-0800    ERROR   error received after stop sequence was engaged  {"error": "Timeout: failed waiting for *v1.Namespace Informer to sync"}
sigs.k8s.io/controller-runtime/pkg/manager.(*controllerManager).engageStopProcedure.func1
    /workspace/vendor/sigs.k8s.io/controller-runtime/pkg/manager/internal.go:490
2024-02-14T08:15:01.944-0800    INFO    Shutdown signal received, waiting for all workers to finish {"controller": "nnfnodestorage", "controllerGroup": "nnf.cray.hpe.com", "controllerKind": "NnfNodeStorage"}
2024-02-14T08:15:01.944-0800    INFO    All workers finished    {"controller": "nnfnodestorage", "controllerGroup": "nnf.cray.hpe.com", "controllerKind": "NnfNodeStorage"}
2024-02-14T08:15:01.944-0800    ERROR   controller-runtime.source.EventHandler  failed to get informer from cache   {"error": "Timeout: failed waiting for *v1alpha2.ClientMount Informer to sync"}
sigs.k8s.io/controller-runtime/pkg/internal/source.(*Kind).Start.func1.1
    /workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/source/kind.go:68
k8s.io/apimachinery/pkg/util/wait.loopConditionUntilContext.func1
    /workspace/vendor/k8s.io/apimachinery/pkg/util/wait/loop.go:49
k8s.io/apimachinery/pkg/util/wait.loopConditionUntilContext
    /workspace/vendor/k8s.io/apimachinery/pkg/util/wait/loop.go:50
k8s.io/apimachinery/pkg/util/wait.PollUntilContextCancel
    /workspace/vendor/k8s.io/apimachinery/pkg/util/wait/poll.go:33
sigs.k8s.io/controller-runtime/pkg/internal/source.(*Kind).Start.func1
    /workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/source/kind.go:56
2024-02-14T08:15:01.944-0800    INFO    Starting workers    {"controller": "nnfnodeecdata", "controllerGroup": "nnf.cray.hpe.com", "controllerKind": "NnfNodeECData", "worker count": 1}
2024-02-14T08:15:01.944-0800    INFO    Shutdown signal received, waiting for all workers to finish {"controller": "nnfnodeecdata", "controllerGroup": "nnf.cray.hpe.com", "controllerKind": "NnfNodeECData"}
2024-02-14T08:15:01.944-0800    INFO    All workers finished    {"controller": "nnfnodeecdata", "controllerGroup": "nnf.cray.hpe.com", "controllerKind": "NnfNodeECData"}
2024-02-14T08:15:01.944-0800    ERROR   controller-runtime.source.EventHandler  failed to get informer from cache   {"error": "Timeout: failed waiting for *v1.Namespace Informer to sync"}
sigs.k8s.io/controller-runtime/pkg/internal/source.(*Kind).Start.func1.1
    /workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/source/kind.go:68
k8s.io/apimachinery/pkg/util/wait.loopConditionUntilContext.func1
    /workspace/vendor/k8s.io/apimachinery/pkg/util/wait/loop.go:49
k8s.io/apimachinery/pkg/util/wait.loopConditionUntilContext
    /workspace/vendor/k8s.io/apimachinery/pkg/util/wait/loop.go:50
k8s.io/apimachinery/pkg/util/wait.PollUntilContextCancel
    /workspace/vendor/k8s.io/apimachinery/pkg/util/wait/poll.go:33
sigs.k8s.io/controller-runtime/pkg/internal/source.(*Kind).Start.func1
    /workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/source/kind.go:56
2024-02-14T08:15:01.944-0800    ERROR   controller-runtime.source.EventHandler  failed to get informer from cache   {"error": "Timeout: failed waiting for *v1alpha1.NnfNode Informer to sync"}
sigs.k8s.io/controller-runtime/pkg/internal/source.(*Kind).Start.func1.1
    /workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/source/kind.go:68
k8s.io/apimachinery/pkg/util/wait.loopConditionUntilContext.func1
    /workspace/vendor/k8s.io/apimachinery/pkg/util/wait/loop.go:49
k8s.io/apimachinery/pkg/util/wait.loopConditionUntilContext
    /workspace/vendor/k8s.io/apimachinery/pkg/util/wait/loop.go:50
k8s.io/apimachinery/pkg/util/wait.PollUntilContextCancel
    /workspace/vendor/k8s.io/apimachinery/pkg/util/wait/poll.go:33
sigs.k8s.io/controller-runtime/pkg/internal/source.(*Kind).Start.func1
    /workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/source/kind.go:56
2024-02-14T08:15:01.944-0800    ERROR   controller-runtime.source.EventHandler  failed to get informer from cache   {"error": "Timeout: failed waiting for *v1alpha1.NnfNodeStorage Informer to sync"}
sigs.k8s.io/controller-runtime/pkg/internal/source.(*Kind).Start.func1.1
    /workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/source/kind.go:68
k8s.io/apimachinery/pkg/util/wait.loopConditionUntilContext.func1
    /workspace/vendor/k8s.io/apimachinery/pkg/util/wait/loop.go:49
k8s.io/apimachinery/pkg/util/wait.loopConditionUntilContext
    /workspace/vendor/k8s.io/apimachinery/pkg/util/wait/loop.go:50
k8s.io/apimachinery/pkg/util/wait.PollUntilContextCancel
    /workspace/vendor/k8s.io/apimachinery/pkg/util/wait/poll.go:33
sigs.k8s.io/controller-runtime/pkg/internal/source.(*Kind).Start.func1
    /workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/source/kind.go:56
2024-02-14T08:15:01.944-0800    INFO    Shutdown signal received, waiting for all workers to finish {"controller": "clientmount", "controllerGroup": "dataworkflowservices.github.io", "controllerKind": "ClientMount"}
2024-02-14T08:15:01.944-0800    INFO    All workers finished    {"controller": "clientmount", "controllerGroup": "dataworkflowservices.github.io", "controllerKind": "ClientMount"}
2024-02-14T08:15:01.944-0800    INFO    Stopping and waiting for caches
2024-02-14T08:15:01.944-0800    INFO    Stopping and waiting for webhooks
2024-02-14T08:15:01.944-0800    INFO    Stopping and waiting for HTTP servers
2024-02-14T08:15:01.944-0800    INFO    shutting down server    {"kind": "health probe", "addr": "[::]:8081"}
2024-02-14T08:15:01.944-0800    INFO    controller-runtime.metrics  Shutting down metrics server with timeout of 1 minute
2024-02-14T08:15:01.944-0800    INFO    Wait completed, proceeding to shutdown the manager
2024-02-14T08:15:01.944-0800    ERROR   setup   problem running manager {"error": "namespaces \"elcap255\" not found"}
main.main
    /workspace/cmd/main.go:146
runtime.main
    /usr/local/go/src/runtime/proc.go:250

roehrich-hpe commented 4 months ago

I tried deleting one of the pods. The DS started a new pod on that node but it encountered the same problem.

roehrich-hpe commented 4 months ago

Sampling one of those nodes, the nnf-dm-worker pod is showing a problem that may be related:

2024-02-13T21:15:04.139Z    INFO    setup   starting manager
2024-02-13T21:15:04.139Z    INFO    controller-runtime.metrics  Starting metrics server
2024-02-13T21:15:04.139Z    INFO    starting server {"kind": "health probe", "addr": "[::]:8081"}
2024-02-13T21:15:04.139Z    INFO    controller-runtime.metrics  Serving metrics server  {"bindAddress": ":8080", "secure": false}
2024-02-13T21:15:04.139Z    INFO    Starting EventSource    {"controller": "nnfdatamovement", "controllerGroup": "nnf.cray.hpe.com", "controllerKind": "NnfDataMovement", "source": "kind source: *v1alpha1.NnfDataMovement"}
2024-02-13T21:15:04.139Z    INFO    Starting Controller {"controller": "nnfdatamovement", "controllerGroup": "nnf.cray.hpe.com", "controllerKind": "NnfDataMovement"}
2024-02-13T21:15:04.449Z    INFO    Starting workers    {"controller": "nnfdatamovement", "controllerGroup": "nnf.cray.hpe.com", "controllerKind": "NnfDataMovement", "worker count": 128}
W0214 15:40:10.343065      13 reflector.go:458] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers.go:208: watch of *v1alpha1.NnfDataMovement ended with: an error on the server ("unable to decode an event from the watch stream: http2: client connection lost") has prevented the request from succeeding
W0214 15:40:10.343067      13 reflector.go:458] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers.go:208: watch of *v1alpha1.NnfDataMovement ended with: an error on the server ("unable to decode an event from the watch stream: http2: client connection lost") has prevented the request from succeeding
W0214 15:40:10.343091      13 reflector.go:458] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers.go:208: watch of *v1alpha1.NnfDataMovement ended with: an error on the server ("unable to decode an event from the watch stream: http2: client connection lost") has prevented the request from succeeding

The kube-proxy and lustre-csi-node on that node do not show problems.

roehrich-hpe commented 4 months ago

I was able to confirm that the crt and token are valid. I used kubectl debug to create a debug pod from one of the CrashLoopBackOff pods, on the same node, and start a shell in the nnf-sos image. That nnf-sos image has a kubectl command in it. I used that to access the kubeapi-server the same way the controller would do it (see client-go's InClusterConfig()), and that works. Then, while in the debug pod, I started /manager and saw it fail as shown above.

roehrich-hpe commented 4 months ago

The errors above are happening with controller-runtime@v0.16.2 and go 1.19.

I tried this with controller-runtime@v0.17.2 (latest at this moment) and go 1.21 and got the same results.

roehrich-hpe commented 4 months ago

Today none of the nodes has a nnf-node-manager controller that is in CLBO. I have stopped all of the pods in the Daemonset and restarted them and there are no CLBOs. I do not know why the problem has gone away; the system is running the original containers, not my debug containers.

NearNodeFlash / NearNodeFlash.github.io

nnf-node-manager CrashLoopBackOff: failed waiting for Informer to sync #129