Closed alexec closed 3 years ago
@alexec is this happening on your test? are you configuring LeaseDuration: 0 * time.Second
on code?
@sarabala1979 no, it just seems to happen after about running for (maybe) 20m+
The issue is fixed in v1.20 Kubernetes. https://github.com/kubernetes/kubernetes/pull/80954
It is very hard to reproduce this issue. it is not consistent. I was able to reproduce 2 times in my local env with k3d. I got one more different error also after that I couldn't reproduce it.
controller | I1217 09:33:17.702074 80579 leaderelection.go:288] failed to renew lease argo/workflow-controller: failed to tryAcquireOrRenew context deadline exceeded
controller | E1217 09:33:20.362137 80579 leaderelection.go:307] Failed to release lock: Operation cannot be fulfilled on leases.coordination.k8s.io "workflow-controller": the object has been modified; please apply your changes to the latest version and try again
FYI It looks like the fix is cherry-picked back to 1.18: https://github.com/kubernetes/kubernetes/pull/80954#issuecomment-717382622
We've hit a similar issue around LEADER_ELECTION_IDENTITY must be set so that the workflow controllers can elect a leader
, which caused our workflow-controller to go into CrashLoopBackOff
and jobs not to be scheduled.
Here are logs, if that's helpful. If there's a way that we could have prevented this, would be great to know. And if anyone has a monitoring approach that would have alerted to something like this, I'd be v interested to learn. Thanks in advance:
time="2020-12-25T20:44:53.729Z" level=info msg="config map" name=workflow-controller-configmap
time="2020-12-25T20:44:53.745Z" level=info msg="Configuration:\nartifactRepository:\n archiveLogs: true\n gcs:\n bucket: {}\n serviceAccountKeySecret:\n key: \"\"\ninitialDelay: 0s\nmetricsConfig: {}\nnodeEvents: {}\nparallelism: 20\npodSpecLogStrategy: {}\ntelemetryConfig: {}\nworkflowDefaults:\n metadata:\n creationTimestamp: null\n spec:\n arguments: {}\n retryStrategy:\n backoff:\n duration: 1m\n factor: 2\n limit: 10\n retryPolicy: Always\n serviceAccountName: {}\n ttlStrategy:\n secondsAfterFailure: 604800\n secondsAfterSuccess: 604800\n status:\n finishedAt: null\n startedAt: null\n"
time="2020-12-25T20:44:53.745Z" level=info msg="Persistence configuration disabled"
time="2020-12-25T20:44:53.746Z" level=info msg="Starting Workflow Controller" version=v2.11.8+b7412aa.dirty
time="2020-12-25T20:44:53.746Z" level=info msg="Workers: workflow: 32, pod: 32, pod cleanup: 4"
time="2020-12-25T20:44:53.867Z" level=info msg="Manager initialized successfully"
time="2020-12-25T20:44:53.867Z" level=fatal msg="LEADER_ELECTION_IDENTITY must be set so that the workflow controllers can elect a leader"
This sequence repeats every five minutes when the workflow-container-controller starts.
You need to set the LEADER_ELECTION_IDENTITY
environment variable in your manifest. This is typically (always?) set to metadata.name
.
Are you using install.sh
from release or just updating the release version?
if you are just updating the release version, you need to add env on your existing deployment
spec.
env:
- name: LEADER_ELECTION_IDENTITY
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: metadata.name
I was using this 2.11.8, tagged with kustomize: https://github.com/argoproj/argo/manifests/cluster-install?ref=v2.11.8
. It had worked for a few weeks before that issue. I'm fairly confident the manifests didn't change.
I don't see any LEADER_ELECTION_IDENTITY
in the manifests until after 2.12.2. I checked the manifests that were deployed and the images were all correctly tagged to 2.11.8 — it doesn't seem to be an issue with an untagged image upgrading accidentally.
So either I made a mistake (very possible) or the error came from a 2.11.8 image.
It resolved after upgrading to 2.12.2 — not sure whether that was the version upgrading or just a change in version. Wiping the whole argo
namespace and reapplying the 2.11.8 manifests didn't help.
Thanks you as ever for engaging and for the phenomenal library.
LEADER_ELECTION_IDENTITY
is a v2.12 feature, not a v2.11 feature, so you should not see anything is the logs if you're running v2.11.8.
docker run argoproj/workflow-controller:v2.11.8 version
Unable to find image 'argoproj/workflow-controller:v2.11.8' locally
v2.11.8: Pulling from argoproj/workflow-controller
e54ef591d839: Already exists
d69ba838510e: Already exists
Digest: sha256:44d87c9f555fc14ef2433eeda4f29d70eab37b6bda7e019192659c95e5ed0161
Status: Downloaded newer image for argoproj/workflow-controller:v2.11.8
workflow-controller: v2.11.8+b7412aa.dirty
BuildDate: 2020-12-24T10:16:20Z
GitCommit: b7412aa1bcff2df20bbe5d515abddb8f33cf4c9e
GitTreeState: dirty
GitTag: v2.10.0-rc1
GoVersion: go1.13.15
Compiler: gc
Platform: linux/amd64
It looks like someone (me?) has overwritten the v2.11.8 controller with a test version.
Why oh why does Docker Hub allow you to overwrite images like this? It is impossible to prevent this from ever happening, even updating build scripts to check that a version does not exist, could not prevent this.
@max-sixty Fixed.
docker run argoproj/workflow-controller:v2.11.8 version
workflow-controller: v2.11.8
BuildDate: 2020-12-29T20:43:36Z
GitCommit: 310e099f82520030246a7c9d66f3efaadac9ade2
GitTreeState: clean
GitTag: v2.11.8
GoVersion: go1.13.4
Compiler: gc
Platform: linux/amd64
Great — thanks a lot for tracking it down!
K8s 1.19 client has a fix for this issue
I'm pretty use we panic here and then we see other things shutting down.
https://github.com/kubernetes/client-go/issues/754