1Password / onepassword-operator

The 1Password Connect Kubernetes Operator provides the ability to integrate Kubernetes Secrets with 1Password. The operator also handles autorestarting deployments when 1Password items are updated.
https://developer.1password.com/docs/connect/
MIT License
533 stars 60 forks source link

Deadlock when onepassword-connect-operator pod enters state "completed" #116

Closed chrissachs closed 10 months ago

chrissachs commented 2 years ago

Your environment

Operator Version: 1.5.1 (helm chart connect-1.7.1)

Connect Server Version: 1.5.0

Kubernetes Version: v1.22.8-gke.201

What happened?

The onepassword-connect-operator stopped syncing the secrets.

There was one pod in state "completed" (pod name onepassword-connect-operator-bf95d87f6-knp4k, no special logs before the shutdown) and a running pod that only logged "not the leader". After starting a new pod on startup it logged:

 {"level":"info","ts":1653895690.521165,"logger":"leader","msg":"Found existing lock","LockOwner":"onepassword-connect-operator-bf95d87f6-knp4k"} 

What did you expect to happen?

The lock of the "completed" pod to be freed.

Steps to reproduce

Not sure what caused the initial pod to go into state "completed", had it already deleted and the logs don't have anything special.

mdnfiras commented 2 years ago

same happened to us. the old pod just got shut down most likely due to the cluster scaling down:

apiVersion: v1
kind: Pod
metadata:
  name: onepassword-connect-operator-77cb97d6f6-w7s2x
  namespace: 1password
status:
  phase: Failed
  message: Pod was terminated in response to imminent node shutdown.
  reason: Terminated
  startTime: '2022-06-15T11:50:42Z'

manually removing this Failed pod with kubectl directly unlocks the newer pod:

kubectl get pods -n 1password
NAME                                            READY   STATUS       RESTARTS   AGE
onepassword-connect-5dc9cdfb94-j9d5l            2/2     Running      0          167m
onepassword-connect-operator-77cb97d6f6-w7s2x   0/1     Terminated   0          25h
onepassword-connect-operator-77cb97d6f6-8pmlq   1/1     Running      0          3h23m

kubectl delete -n 1password pod onepassword-connect-operator-77cb97d6f6-w7s2x
pod "onepassword-connect-operator-77cb97d6f6-w7s2x" deleted

It's as if the operator scans for onepassword-connect-operator-... pods and assumes that all existing pods are functional/Running, and therefore new pods don't take the lead.

baracoder commented 2 years ago

Sounds similar to this bug, but that is fixed years ago: https://github.com/operator-framework/operator-sdk/pull/2210

edif2008 commented 1 year ago

Hey there, @chrissachs, @mdnfiras and @baracoder. Have you tried updating to the latest version of the operator (1.6.0)? That one uses the latest version of the operator-sdk, which should fix the issue you're facing.

edif2008 commented 10 months ago

Closing this since there were no responses. Feel free to reopen if it still persists.