couchbase-partners / helm-charts

Helm charts for deployed couchbase services
Apache License 2.0
25 stars 39 forks source link

Operator failure leads into `/pools/default 404 Object Not Found` #109

Open jloehel opened 1 year ago

jloehel commented 1 year ago

Couchbase operator version:

    - name: couchbase-operator
      version: 2.32.2
      repository: https://couchbase-partners.github.io/helm-charts/
      condition: couchbase-operator.enabled

I have some issues deploying it on minikube/kind. The pod for the cluster gets created and after some time the operator fails with:

{"level":"error","ts":1684787298.9081988,"msg":"Failed to update lock: resource name may not be empty\n","stacktrace":"k8s.io/client-go/tools/leaderelection.(*LeaderElector).renew.func1.1\n\tk8s.io/client-go@v0.23.2/tools/leaderelection/leaderelection.go:272\nk8s.io/apimachinery/pkg/util/wait.ConditionFunc.WithContext.func1\n\tk8s.io/apimachinery@v0.23.2/pkg/util/wait/wait.go:220\nk8s.io/apimachinery/pkg/util/wait.runConditionWithCrashProtectionWithContext\n\tk8s.io/apimachinery@v0.23.2/pkg/util/wait/wait.go:233\nk8s.io/apimachinery/pkg/util/wait.poll\n\tk8s.io/apimachinery@v0.23.2/pkg/util/wait/wait.go:580\nk8s.io/apimachinery/pkg/util/wait.PollImmediateUntilWithContext\n\tk8s.io/apimachinery@v0.23.2/pkg/util/wait/wait.go:545\nk8s.io/apimachinery/pkg/util/wait.PollImmediateUntil\n\tk8s.io/apimachinery@v0.23.2/pkg/util/wait/wait.go:536\nk8s.io/client-go/tools/leaderelection.(*LeaderElector).renew.func1\n\tk8s.io/client-go@v0.23.2/tools/leaderelection/leaderelection.go:271\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1\n\tk8s.io/apimachinery@v0.23.2/pkg/util/wait/wait.go:155\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil\n\tk8s.io/apimachinery@v0.23.2/pkg/util/wait/wait.go:156\nk8s.io/apimachinery/pkg/util/wait.JitterUntil\n\tk8s.io/apimachinery@v0.23.2/pkg/util/wait/wait.go:133\nk8s.io/apimachinery/pkg/util/wait.Until\n\tk8s.io/apimachinery@v0.23.2/pkg/util/wait/wait.go:90\nk8s.io/client-go/tools/leaderelection.(*LeaderElector).renew\n\tk8s.io/client-go@v0.23.2/tools/leaderelection/leaderelection.go:268\nk8s.io/client-go/tools/leaderelection.(*LeaderElector).Run\n\tk8s.io/client-go@v0.23.2/tools/leaderelection/leaderelection.go:212\nsigs.k8s.io/controller-runtime/pkg/manager.(*controllerManager).startLeaderElection.func3\n\tsigs.k8s.io/controller-runtime@v0.11.0/pkg/manager/internal.go:642"}
{"level":"info","ts":1684787298.908323,"msg":"failed to renew lease default/couchbase-operator: timed out waiting for the condition\n"}
{"level":"error","ts":1684787298.908393,"logger":"main","msg":"Error starting resource manager","error":"leader election lost","stacktrace":"main.main\n\tgithub.com/ ...

I guess it's the fault of the kube-scheduler. I haven't figured it out yet. After the operator is up again I receive the following error message:

ERR ts=1684780626.1920617 logger=cluster msg=Failed to update members cluster=saferwall/couchbase-cluster error=unexpected status code: request failed GET http://couchbase-cluster-0000.couchbase-cluster.saferwall.svc:8091/pools/default 404 Object Not Found: "unknown pool" stacktrace=github.com/couchbase/couchbase-operator/pkg/cluster.(*Cluster).runReconcile

If I understand it correctly, there is no pool yet. The pool gets created together with the cluster. If I create the cluster manually there is a default pool. But the UUIDs do not match anymore. Curl output from couchbase-cluster-0000:

couchbase@couchbase-cluster-0000:/$ curl http://127.0.0.1:8091/pools/default
"unknown pool"
couchbase@couchbase-cluster-0000:/$ curl http://127.0.0.1:8091/pools/       
{
  "isAdminCreds": true,
  "isROAdminCreds": false,
  "isEnterprise": true,
  "allowedServices": [
    "kv",
    "n1ql",
    "index",
    "fts",
    "cbas",
    "eventing",
    "backup"
  ],
  "isDeveloperPreview": false,
  "packageVariant": "ubuntu20.04/docker",
  "pools": [],
  "settings": [],
  "uuid": [],
  "implementationVersion": "7.0.2-6703-enterprise",
  "componentsVersion": {
    "stdlib": "3.12.1",
    "ale": "0.0.0",
    "inets": "7.1.3.3",
    "lhttpc": "1.3.0",
    "sasl": "3.4.2",
    "crypto": "4.6.5.2",
    "ssl": "9.6.2.3",
    "ns_server": "7.0.2-6703-enterprise",
    "public_key": "1.7.2",
    "os_mon": "2.5.1.1",
    "asn1": "5.0.12",
    "kernel": "6.5.2.1",
    "chronicle": "0.0.1"
  }
}
couchbase@couchbase-cluster-0000:/$

Is it possible to recover safely from the first failure? If you need more info please ping me.

bisonlou commented 1 month ago

I am also getting the same error request failed: unexpected status code GET https://oaf-couchbase-0000.oaf-couchbase.couchbase.svc:18091/pools/default 404 Object Not Found: \"unknown pool\""