argoproj / argo-workflows

Workflow Engine for Kubernetes
https://argo-workflows.readthedocs.io/
Apache License 2.0
15.06k stars 3.2k forks source link

v2.12: leadship election panics and crashes controller #4761

Closed alexec closed 3 years ago

alexec commented 3 years ago
 controller | time="2020-12-16T10:35:34.758Z" level=info msg="Deleting TTL expired workflow argo/calendar-workflow-x28p4"
 controller | E1216 11:15:18.671907   78349 leaderelection.go:307] Failed to release lock: Lease.coordination.k8s.io "workflow-controller" is invalid: spec.leaseDurationSeconds: Invalid value: 0: must be greater than 0

I'm pretty use we panic here and then we see other things shutting down.

 controller | time="2020-12-16T11:15:18.671Z" level=info msg="stopped leading" id=local
 controller | time="2020-12-16T11:15:18.672Z" level=info msg="Shutting workflow TTL worker"
 controller | panic: http: Server closed
 controller | goroutine 279 [running]:
 controller | github.com/argoproj/argo/workflow/metrics.runServer.func1(0x1, 0xc00074df68, 0x8, 0x2382, 0x0, 0x0, 0xc000836000)
 controller |   /Users/acollins8/go/src/github.com/argoproj/argo/workflow/metrics/server.go:53 +0x117
 controller | created by github.com/argoproj/argo/workflow/metrics.runServer
 controller |   /Users/acollins8/go/src/github.com/argoproj/argo/workflow/metrics/server.go:50 +0x246
 controller | Terminating controller

https://github.com/kubernetes/client-go/issues/754

sarabala1979 commented 3 years ago

@alexec is this happening on your test? are you configuring LeaseDuration: 0 * time.Second on code?

alexec commented 3 years ago

@sarabala1979 no, it just seems to happen after about running for (maybe) 20m+

sarabala1979 commented 3 years ago

The issue is fixed in v1.20 Kubernetes. https://github.com/kubernetes/kubernetes/pull/80954

sarabala1979 commented 3 years ago

It is very hard to reproduce this issue. it is not consistent. I was able to reproduce 2 times in my local env with k3d. I got one more different error also after that I couldn't reproduce it.

controller | I1217 09:33:17.702074   80579 leaderelection.go:288] failed to renew lease argo/workflow-controller: failed to tryAcquireOrRenew context deadline exceeded
controller | E1217 09:33:20.362137   80579 leaderelection.go:307] Failed to release lock: Operation cannot be fulfilled on leases.coordination.k8s.io "workflow-controller": the object has been modified; please apply your changes to the latest version and try again
max-sixty commented 3 years ago

FYI It looks like the fix is cherry-picked back to 1.18: https://github.com/kubernetes/kubernetes/pull/80954#issuecomment-717382622

max-sixty commented 3 years ago

We've hit a similar issue around LEADER_ELECTION_IDENTITY must be set so that the workflow controllers can elect a leader, which caused our workflow-controller to go into CrashLoopBackOff and jobs not to be scheduled.

Here are logs, if that's helpful. If there's a way that we could have prevented this, would be great to know. And if anyone has a monitoring approach that would have alerted to something like this, I'd be v interested to learn. Thanks in advance:

time="2020-12-25T20:44:53.729Z" level=info msg="config map" name=workflow-controller-configmap
time="2020-12-25T20:44:53.745Z" level=info msg="Configuration:\nartifactRepository:\n  archiveLogs: true\n  gcs:\n    bucket: {}\n    serviceAccountKeySecret:\n      key: \"\"\ninitialDelay: 0s\nmetricsConfig: {}\nnodeEvents: {}\nparallelism: 20\npodSpecLogStrategy: {}\ntelemetryConfig: {}\nworkflowDefaults:\n  metadata:\n    creationTimestamp: null\n  spec:\n    arguments: {}\n    retryStrategy:\n      backoff:\n        duration: 1m\n        factor: 2\n      limit: 10\n      retryPolicy: Always\n    serviceAccountName: {}\n    ttlStrategy:\n      secondsAfterFailure: 604800\n      secondsAfterSuccess: 604800\n  status:\n    finishedAt: null\n    startedAt: null\n"
time="2020-12-25T20:44:53.745Z" level=info msg="Persistence configuration disabled"
time="2020-12-25T20:44:53.746Z" level=info msg="Starting Workflow Controller" version=v2.11.8+b7412aa.dirty
time="2020-12-25T20:44:53.746Z" level=info msg="Workers: workflow: 32, pod: 32, pod cleanup: 4"
time="2020-12-25T20:44:53.867Z" level=info msg="Manager initialized successfully"
time="2020-12-25T20:44:53.867Z" level=fatal msg="LEADER_ELECTION_IDENTITY must be set so that the workflow controllers can elect a leader"

This sequence repeats every five minutes when the workflow-container-controller starts.

alexec commented 3 years ago

You need to set the LEADER_ELECTION_IDENTITY environment variable in your manifest. This is typically (always?) set to metadata.name.

sarabala1979 commented 3 years ago

Are you using install.sh from release or just updating the release version? if you are just updating the release version, you need to add env on your existing deployment spec.

        env:
        - name: LEADER_ELECTION_IDENTITY
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: metadata.name
max-sixty commented 3 years ago

I was using this 2.11.8, tagged with kustomize: https://github.com/argoproj/argo/manifests/cluster-install?ref=v2.11.8. It had worked for a few weeks before that issue. I'm fairly confident the manifests didn't change.

I don't see any LEADER_ELECTION_IDENTITY in the manifests until after 2.12.2. I checked the manifests that were deployed and the images were all correctly tagged to 2.11.8 — it doesn't seem to be an issue with an untagged image upgrading accidentally.

So either I made a mistake (very possible) or the error came from a 2.11.8 image.

It resolved after upgrading to 2.12.2 — not sure whether that was the version upgrading or just a change in version. Wiping the whole argo namespace and reapplying the 2.11.8 manifests didn't help.

Thanks you as ever for engaging and for the phenomenal library.

alexec commented 3 years ago

LEADER_ELECTION_IDENTITY is a v2.12 feature, not a v2.11 feature, so you should not see anything is the logs if you're running v2.11.8.

docker run argoproj/workflow-controller:v2.11.8 version
Unable to find image 'argoproj/workflow-controller:v2.11.8' locally
v2.11.8: Pulling from argoproj/workflow-controller
e54ef591d839: Already exists 
d69ba838510e: Already exists 
Digest: sha256:44d87c9f555fc14ef2433eeda4f29d70eab37b6bda7e019192659c95e5ed0161
Status: Downloaded newer image for argoproj/workflow-controller:v2.11.8
workflow-controller: v2.11.8+b7412aa.dirty
  BuildDate: 2020-12-24T10:16:20Z
  GitCommit: b7412aa1bcff2df20bbe5d515abddb8f33cf4c9e
  GitTreeState: dirty
  GitTag: v2.10.0-rc1
  GoVersion: go1.13.15
  Compiler: gc
  Platform: linux/amd64

It looks like someone (me?) has overwritten the v2.11.8 controller with a test version.

alexec commented 3 years ago

Why oh why does Docker Hub allow you to overwrite images like this? It is impossible to prevent this from ever happening, even updating build scripts to check that a version does not exist, could not prevent this.

alexec commented 3 years ago

@max-sixty Fixed.

docker run argoproj/workflow-controller:v2.11.8 version
workflow-controller: v2.11.8
  BuildDate: 2020-12-29T20:43:36Z
  GitCommit: 310e099f82520030246a7c9d66f3efaadac9ade2
  GitTreeState: clean
  GitTag: v2.11.8
  GoVersion: go1.13.4
  Compiler: gc
  Platform: linux/amd64
max-sixty commented 3 years ago

Great — thanks a lot for tracking it down!

sarabala1979 commented 3 years ago

K8s 1.19 client has a fix for this issue