chaos-mesh / chaos-mesh

A Chaos Engineering Platform for Kubernetes.
https://chaos-mesh.org
Apache License 2.0
6.79k stars 840 forks source link

chaos-controller-manager CrashLoopBackOff, reporting failed to get informer from cache and too many open files #4429

Open datawine opened 6 months ago

datawine commented 6 months ago

Bug Report

What version of Kubernetes are you using?

1.29.0

What version of Chaos Mesh are you using?

2.6.3

What did you do? / Minimal Reproducible Example

I tried both curl -sSL https://mirrors.chaos-mesh.org/v2.6.3/install.sh | bash and the helm installation (and of course, before uninstallation) seperately for multiple times. Also, I have updated the CRD multiple times with curl -sSL https://mirrors.chaos-mesh.org/v2.6.3/crd.yaml | kubectl replace -f - and helm upgrade --install chaos-mesh chaos-mesh/chaos-mesh -n=chaos-mesh --set chaosDaemon.runtime=containerd --set chaosDaemon.socketPath=/run/containerd/containerd.sock --set dashboard.securityMode=false --version 2.6.3

What did you expect to see?

What did you see instead?

However, i got

root@ubuntu2204:~# kubectl -n chaos-mesh get po
NAME                                       READY   STATUS             RESTARTS        AGE
chaos-controller-manager-c46c6967d-2m7vk   0/1     CrashLoopBackOff   7 (2m23s ago)   13m
chaos-controller-manager-c46c6967d-9zfkh   0/1     CrashLoopBackOff   7 (2m27s ago)   13m
chaos-controller-manager-c46c6967d-x5q5q   0/1     CrashLoopBackOff   7 (2m49s ago)   13m
chaos-daemon-l79sh                         1/1     Running            0               13m
chaos-daemon-njlkt                         1/1     Running            0               13m
chaos-daemon-pzdlx                         1/1     Running            0               13m
chaos-dashboard-576ddc88c4-54pnr           1/1     Running            0               13m
chaos-dns-server-5b65bd45d5-4x9g9          1/1     Running            0               13m

The kubectl describe pods doesnt output much information:

Events:
  Type     Reason     Age                  From               Message
  ----     ------     ----                 ----               -------
  Normal   Scheduled  17m                  default-scheduler  Successfully assigned chaos-mesh/chaos-controller-manager-c46c6967d-2m7vk to ubuntu2204-worker-1
  Normal   Pulled     15m (x5 over 17m)    kubelet            Container image "ghcr.io/chaos-mesh/chaos-mesh:v2.6.3" already present on machine
  Normal   Created    15m (x5 over 17m)    kubelet            Created container chaos-mesh
  Normal   Started    15m (x5 over 17m)    kubelet            Started container chaos-mesh
  Warning  BackOff    112s (x70 over 17m)  kubelet            Back-off restarting failed container chaos-mesh in pod chaos-controller-manager-c46c6967d-2m7vk_chaos-mesh(cbed0bb0-0279-4e3d-a4dd-a4a827e111f4)

what kubectl logs outputs:

root@ubuntu2204:~# kubectl -n chaos-mesh logs chaos-controller-manager-c46c6967d-2m7vk | tail -n 100
    /tmp/go/pkg/mod/k8s.io/apimachinery@v0.26.1/pkg/util/wait/wait.go:235
k8s.io/apimachinery/pkg/util/wait.poll
    /tmp/go/pkg/mod/k8s.io/apimachinery@v0.26.1/pkg/util/wait/wait.go:582
k8s.io/apimachinery/pkg/util/wait.PollImmediateUntilWithContext
    /tmp/go/pkg/mod/k8s.io/apimachinery@v0.26.1/pkg/util/wait/wait.go:547
sigs.k8s.io/controller-runtime/pkg/source.(*Kind).Start.func1
    /tmp/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.2/pkg/source/source.go:136
2024-06-01T05:55:41.535Z    INFO    controller/controller.go:250    All workers finished    {"controller": "schedule-cron", "controllerGroup": "chaos-mesh.org", "controllerKind": "Schedule"}
2024-06-01T05:55:41.535Z    ERROR   controller-runtime.source   source/source.go:148    failed to get informer from cache   {"error": "Timeout: failed waiting for *v1alpha1.Schedule Informer to sync"}
sigs.k8s.io/controller-runtime/pkg/source.(*Kind).Start.func1.1
    /tmp/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.2/pkg/source/source.go:148
k8s.io/apimachinery/pkg/util/wait.runConditionWithCrashProtectionWithContext
    /tmp/go/pkg/mod/k8s.io/apimachinery@v0.26.1/pkg/util/wait/wait.go:235
k8s.io/apimachinery/pkg/util/wait.poll
    /tmp/go/pkg/mod/k8s.io/apimachinery@v0.26.1/pkg/util/wait/wait.go:582
k8s.io/apimachinery/pkg/util/wait.PollImmediateUntilWithContext
    /tmp/go/pkg/mod/k8s.io/apimachinery@v0.26.1/pkg/util/wait/wait.go:547
sigs.k8s.io/controller-runtime/pkg/source.(*Kind).Start.func1
    /tmp/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.2/pkg/source/source.go:136
2024-06-01T05:55:41.535Z    ERROR   controller-runtime.source   source/source.go:148    failed to get informer from cache   {"error": "Timeout: failed waiting for *v1alpha1.Schedule Informer to sync"}
sigs.k8s.io/controller-runtime/pkg/source.(*Kind).Start.func1.1
    /tmp/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.2/pkg/source/source.go:148
k8s.io/apimachinery/pkg/util/wait.runConditionWithCrashProtectionWithContext
    /tmp/go/pkg/mod/k8s.io/apimachinery@v0.26.1/pkg/util/wait/wait.go:235
k8s.io/apimachinery/pkg/util/wait.poll
    /tmp/go/pkg/mod/k8s.io/apimachinery@v0.26.1/pkg/util/wait/wait.go:582
k8s.io/apimachinery/pkg/util/wait.PollImmediateUntilWithContext
    /tmp/go/pkg/mod/k8s.io/apimachinery@v0.26.1/pkg/util/wait/wait.go:547
sigs.k8s.io/controller-runtime/pkg/source.(*Kind).Start.func1
    /tmp/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.2/pkg/source/source.go:136
2024-06-01T05:55:41.535Z    ERROR   controller-runtime.source   source/source.go:148    failed to get informer from cache   {"error": "Timeout: failed waiting for *v1alpha1.Workflow Informer to sync"}
sigs.k8s.io/controller-runtime/pkg/source.(*Kind).Start.func1.1
    /tmp/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.2/pkg/source/source.go:148
k8s.io/apimachinery/pkg/util/wait.runConditionWithCrashProtectionWithContext
    /tmp/go/pkg/mod/k8s.io/apimachinery@v0.26.1/pkg/util/wait/wait.go:235
k8s.io/apimachinery/pkg/util/wait.poll
    /tmp/go/pkg/mod/k8s.io/apimachinery@v0.26.1/pkg/util/wait/wait.go:582
k8s.io/apimachinery/pkg/util/wait.PollImmediateUntilWithContext
    /tmp/go/pkg/mod/k8s.io/apimachinery@v0.26.1/pkg/util/wait/wait.go:547
sigs.k8s.io/controller-runtime/pkg/source.(*Kind).Start.func1
    /tmp/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.2/pkg/source/source.go:136
2024-06-01T05:55:41.535Z    ERROR   controller-runtime.source   source/source.go:148    failed to get informer from cache   {"error": "Timeout: failed waiting for *v1alpha1.AWSChaos Informer to sync"}
sigs.k8s.io/controller-runtime/pkg/source.(*Kind).Start.func1.1
    /tmp/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.2/pkg/source/source.go:148
k8s.io/apimachinery/pkg/util/wait.runConditionWithCrashProtectionWithContext
    /tmp/go/pkg/mod/k8s.io/apimachinery@v0.26.1/pkg/util/wait/wait.go:235
k8s.io/apimachinery/pkg/util/wait.poll
    /tmp/go/pkg/mod/k8s.io/apimachinery@v0.26.1/pkg/util/wait/wait.go:582
k8s.io/apimachinery/pkg/util/wait.PollImmediateUntilWithContext
    /tmp/go/pkg/mod/k8s.io/apimachinery@v0.26.1/pkg/util/wait/wait.go:547
sigs.k8s.io/controller-runtime/pkg/source.(*Kind).Start.func1
    /tmp/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.2/pkg/source/source.go:136
2024-06-01T05:55:41.535Z    ERROR   controller-runtime.source   source/source.go:148    failed to get informer from cache   {"error": "Timeout: failed waiting for *v1alpha1.PhysicalMachineChaos Informer to sync"}
sigs.k8s.io/controller-runtime/pkg/source.(*Kind).Start.func1.1
    /tmp/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.2/pkg/source/source.go:148
k8s.io/apimachinery/pkg/util/wait.runConditionWithCrashProtectionWithContext
    /tmp/go/pkg/mod/k8s.io/apimachinery@v0.26.1/pkg/util/wait/wait.go:235
k8s.io/apimachinery/pkg/util/wait.poll
    /tmp/go/pkg/mod/k8s.io/apimachinery@v0.26.1/pkg/util/wait/wait.go:582
k8s.io/apimachinery/pkg/util/wait.PollImmediateUntilWithContext
    /tmp/go/pkg/mod/k8s.io/apimachinery@v0.26.1/pkg/util/wait/wait.go:547
sigs.k8s.io/controller-runtime/pkg/source.(*Kind).Start.func1
    /tmp/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.2/pkg/source/source.go:136
2024-06-01T05:55:41.535Z    INFO    controller/controller.go:228    Starting workers    {"controller": "stresschaos-remotechaos", "controllerGroup": "chaos-mesh.org", "controllerKind": "StressChaos", "worker count": 1}
2024-06-01T05:55:41.535Z    INFO    controller/controller.go:228    Starting workers    {"controller": "jvmchaos-remotechaos", "controllerGroup": "chaos-mesh.org", "controllerKind": "JVMChaos", "worker count": 1}
2024-06-01T05:55:41.535Z    INFO    controller/controller.go:248    Shutdown signal received, waiting for all workers to finish {"controller": "stresschaos-remotechaos", "controllerGroup": "chaos-mesh.org", "controllerKind": "StressChaos"}
2024-06-01T05:55:41.535Z    INFO    controller/controller.go:248    Shutdown signal received, waiting for all workers to finish {"controller": "jvmchaos-remotechaos", "controllerGroup": "chaos-mesh.org", "controllerKind": "JVMChaos"}
2024-06-01T05:55:41.535Z    INFO    controller/controller.go:250    All workers finished    {"controller": "jvmchaos-remotechaos", "controllerGroup": "chaos-mesh.org", "controllerKind": "JVMChaos"}
2024-06-01T05:55:41.535Z    INFO    controller/controller.go:248    Shutdown signal received, waiting for all workers to finish {"controller": "workflow-abort-workflow-reconciler", "controllerGroup": "chaos-mesh.org", "controllerKind": "Workflow"}
2024-06-01T05:55:41.535Z    INFO    controller/controller.go:250    All workers finished    {"controller": "workflow-abort-workflow-reconciler", "controllerGroup": "chaos-mesh.org", "controllerKind": "Workflow"}
2024-06-01T05:55:41.535Z    INFO    controller/controller.go:250    All workers finished    {"controller": "stresschaos-remotechaos", "controllerGroup": "chaos-mesh.org", "controllerKind": "StressChaos"}
2024-06-01T05:55:41.535Z    INFO    manager/internal.go:586 Stopping and waiting for caches
2024-06-01T05:55:41.536Z    INFO    manager/internal.go:590 Stopping and waiting for webhooks
2024-06-01T05:55:41.536Z    INFO    manager/internal.go:594 Wait completed, proceeding to shutdown the manager
E0601 05:55:41.536309       1 leaderelection.go:330] error retrieving resource lock chaos-mesh/chaos-mesh: Get "https://10.96.0.1:443/api/v1/namespaces/chaos-mesh/configmaps/chaos-mesh": context canceled
2024-06-01T05:55:41.536Z    ERROR   setup   chaos-controller-manager/main.go:208    unable to start manager {"error": "too many open files"}
main.Run
    /home/runner/work/chaos-mesh/chaos-mesh/cmd/chaos-controller-manager/main.go:208
reflect.Value.call
    /usr/local/go/src/reflect/value.go:584
reflect.Value.Call
    /usr/local/go/src/reflect/value.go:368
go.uber.org/dig.defaultInvoker
    /tmp/go/pkg/mod/go.uber.org/dig@v1.16.1/container.go:238
go.uber.org/dig.(*Scope).Invoke
    /tmp/go/pkg/mod/go.uber.org/dig@v1.16.1/invoke.go:108
go.uber.org/dig.(*Container).Invoke
    /tmp/go/pkg/mod/go.uber.org/dig@v1.16.1/invoke.go:50
go.uber.org/fx.runInvoke
    /tmp/go/pkg/mod/go.uber.org/fx@v1.19.2/invoke.go:108
go.uber.org/fx.(*module).executeInvoke
    /tmp/go/pkg/mod/go.uber.org/fx@v1.19.2/module.go:246
go.uber.org/fx.(*module).executeInvokes
    /tmp/go/pkg/mod/go.uber.org/fx@v1.19.2/module.go:232
go.uber.org/fx.New
    /tmp/go/pkg/mod/go.uber.org/fx@v1.19.2/app.go:502
main.main
    /home/runner/work/chaos-mesh/chaos-mesh/cmd/chaos-controller-manager/main.go:80
runtime.main
    /usr/local/go/src/runtime/proc.go:250
root@ubuntu2204:~#

what ulimits -n outputs:

root@ubuntu2204:~# ulimit -n
65535

I believe the "too many open files" error is just a shallow syptom and the "failed to get informer from cache" is the real problem. Can anyone plz help?

AndrewCi commented 5 months ago

@datawine I am also facing this same issue! Were you able to resolve? I read on the docs that Chaos Mesh only supports Kubernetes v1.28 and below and I saw that you're on v1.29.

I initially attempted the install on v1.29, but downgrades to v1.28.5 and I'm still facing the same error.

Any help from the community would be greatly appreciated.

I'm installing this on AKS and I followed all of the pre-reqs described in the AKS docs e.g. local accounts enabled, etc. with no luck.

AndrewCi commented 5 months ago

This resolved my issue (installing via CMD prompt and not git bash on Windows)

https://github.com/chaos-mesh/chaos-mesh/discussions/3954

May be worth adding this to the docs in case anyone else is installing via git bash