Dynatrace / dynatrace-oneagent-operator

Kubernetes/Openshift Operator for managing Dynatrace OneAgent deployments
Apache License 2.0
88 stars 48 forks source link

Webhook pod leadership lock mismatch #352

Closed BrendanGalloway closed 3 years ago

BrendanGalloway commented 3 years ago

Running dynatrace-oneagent-operator v0.8.2, I have a situation where the webhook pods cannot complete startup due to a mismatch in leadership locks between the bootstrapper and the webhook. There are two pods running - on pod A, the bootstrapper has the leadership lock and has completed the bootstrap process. However, the on Pod B, the webhook pod has the leadership lock. The webhook on Pod A is stuck waiting to become leader, while in Pod B the webhook is crashlooping - the certificates it requires are never created due to Pod B bootstrap pod being stuck waiting for the lock.

lrgar commented 3 years ago

Hi. Both the bootstrapper and the webhook containers (should) use different leader locks. Can you provide logs from both containers, so that I can take a look?

Also, was this caused by a new installation, or an upgrade?

BrendanGalloway commented 3 years ago

After posting the issue I realised I could fix the issue by scaling the deployment down to one replica and then back up to two, which worked to fix the problem. So unfortunately I only have one half of the logs to hand now:

logs dynatrace-oneagent-webhook-6dbb9ccc85-8ktmw webhook

{"level":"info","ts":"2020-10-23T08:49:25.206Z","logger":"cmd","msg":"Go Version: go1.14.9"}
{"level":"info","ts":"2020-10-23T08:49:25.207Z","logger":"cmd","msg":"Go OS/Arch: linux/amd64"}
{"level":"info","ts":"2020-10-23T08:49:25.207Z","logger":"cmd","msg":"Version of operator-sdk: v0.17.0"}
{"level":"info","ts":"2020-10-23T08:49:25.207Z","logger":"cmd","msg":"Version of dynatrace-oneagent-operator: v0.8.2"}
{"level":"info","ts":"2020-10-23T08:49:25.207Z","logger":"leader","msg":"Trying to become the leader."}
{"level":"info","ts":"2020-10-23T08:49:26.702Z","logger":"leader","msg":"Found existing lock","LockOwner":"dynatrace-oneagent-webhook-6dbb9ccc85-6472b"}
{"level":"info","ts":"2020-10-23T08:49:26.731Z","logger":"leader","msg":"Not the leader. Waiting."}
{"level":"info","ts":"2020-10-23T08:49:27.870Z","logger":"leader","msg":"Not the leader. Waiting."}
{"level":"info","ts":"2020-10-23T08:49:30.266Z","logger":"leader","msg":"Not the leader. Waiting."}

logs dynatrace-oneagent-webhook-6dbb9ccc85-8ktmw bootstrapper

{"level":"info","ts":"2020-10-23T08:40:42.414Z","logger":"cmd","msg":"Go Version: go1.14.9"}
{"level":"info","ts":"2020-10-23T08:40:42.414Z","logger":"cmd","msg":"Go OS/Arch: linux/amd64"}
{"level":"info","ts":"2020-10-23T08:40:42.414Z","logger":"cmd","msg":"Version of operator-sdk: v0.17.0"}
{"level":"info","ts":"2020-10-23T08:40:42.414Z","logger":"cmd","msg":"Version of dynatrace-oneagent-operator: v0.8.2"}
{"level":"info","ts":"2020-10-23T08:40:42.417Z","logger":"leader","msg":"Trying to become the leader."}
{"level":"info","ts":"2020-10-23T08:40:43.715Z","logger":"leader","msg":"Found existing lock with my name. I was likely restarted."}
{"level":"info","ts":"2020-10-23T08:40:43.715Z","logger":"leader","msg":"Continuing as the leader."}
{"level":"info","ts":"2020-10-23T08:40:44.817Z","logger":"cmd","msg":"Registering Components."}
{"level":"info","ts":"2020-10-23T08:40:44.817Z","logger":"cmd","msg":"Starting the Cmd."}
{"level":"info","ts":"2020-10-23T08:40:44.818Z","logger":"controller-runtime.controller","msg":"Starting EventSource","controller":"webhook-bootstrapper-controller","source":"channel source: 0xc00055b450"}
{"level":"info","ts":"2020-10-23T08:40:44.818Z","logger":"controller-runtime.controller","msg":"Starting Controller","controller":"webhook-bootstrapper-controller"}
{"level":"info","ts":"2020-10-23T08:40:44.818Z","logger":"controller-runtime.controller","msg":"Starting workers","controller":"webhook-bootstrapper-controller","worker count":1}
{"level":"info","ts":"2020-10-23T08:40:54.818Z","logger":"webhook.controller","msg":"reconciling webhook","namespace":"dynatrace","name":"dynatrace-oneagent-webhook"}
{"level":"info","ts":"2020-10-23T08:40:54.818Z","logger":"webhook.controller","msg":"Reconciling certificates..."}
{"level":"info","ts":"2020-10-23T08:40:54.922Z","logger":"webhook.controller","msg":"Reconciling Service..."}
{"level":"info","ts":"2020-10-23T08:40:55.024Z","logger":"webhook.controller","msg":"Reconciling MutatingWebhookConfiguration..."}

The logs for dynatrace-oneagent-webhook-6dbb9ccc85-6472b showed the bootstrapper detect that 8ktmw was the leader, while the webhook container found a lock with its name, then looped through "Waiting for certificates to be available." until it crashed.

I'm not sure the history of the deployment that led to this situation, but if both pods start at the same time, and both bootstrapper and webhook run through the same code up to the leadership check, then a race condition seems likely? Since the locks are different, it seems like there's no guaranteed that both will be claimed by containers in the same pod.

lrgar commented 3 years ago

scaling the deployment down to one replica and then back up to two

That is the reason in fact, the current implementation expects to run with one replica. Well, it's a bit more of an issue with the bootstrapper (who generates and renews the SSL certificates for the webhook HTTPS server), who may then need to coordinate with other instances.

Any reason to have multiple replicas? I presume high availability, but to confirm.

BrendanGalloway commented 3 years ago

Hi - it is indeed for availability. We can reduce the replicas to 1 for now, but we would definitely prefer two are available.

lrgar commented 3 years ago

Thank you for the feedback. I've created an internal ticket to look at it.