argoproj / argo-cd

Declarative Continuous Deployment for Kubernetes
https://argo-cd.readthedocs.io
Apache License 2.0
18.02k stars 5.5k forks source link

Cannot update self-managed ArgoCD to 2.12 due to race condition between argocd-redis and argocd-application-controller #19798

Open akloss-cibo opened 2 months ago

akloss-cibo commented 2 months ago

Checklist:

Describe the bug

When updating an ArgoCD self-managed installation (from 2.10.9) to 2.12.3 using the Kustomization at https://github.com/argoproj/argo-cd/manifests/cluster-install?ref=v2.12.3, things go badly. Several times that I've tried, the argocd-application-controller StatefulSet gets updated before the argocd-redis Deployment. The new pod for the updated argocd-application-controller StatefulSet won't start because the argocd-redis Secret hasn't been populated by the init container in the argocd-redis Deployment, and the argocd-redis Deployment will never be updated because there's no running argocd-application-controller pod any more.

To Reproduce

Steps to reproduce:

  1. Create and ArgoCD Application to install ArgoCD from the manifest at github.com/argoproj/argo-cd/manifests/cluster-install?ref=v2.10.9
  2. Update the Application to target github.com/argoproj/argo-cd/manifests/cluster-install?ref=v2.12.3
  3. Observe that the new argocd-application-controller-0 pod won't start because the argocd-redis Secret doesn't exit and the argocd-redis Deployment is still out-of-sync because there is no argocd-application-controller pod running to sync it.

Expected behavior

ArgoCD should apply updated from the Kustomization in an order that ensures the argocd-redis Secret exists before updating the argocd-application-controller to depend on the argocd-redis Secret.

I've addressed this by adding sync-waves into a kustomization overlay, but adding things to the mainline would make this work for everyone without patching the redis resources:

diff --git a/manifests/base/redis/argocd-redis-deployment.yaml b/manifests/base/redis/argocd-redis-deployment.yaml
index c591db0d0..9861b5656 100644
--- a/manifests/base/redis/argocd-redis-deployment.yaml
+++ b/manifests/base/redis/argocd-redis-deployment.yaml
@@ -1,6 +1,8 @@
 apiVersion: apps/v1
 kind: Deployment
 metadata:
+  annotations:
+    argocd.argoproj.io/sync-wave: "-1"
   labels:
     app.kubernetes.io/name: argocd-redis
     app.kubernetes.io/part-of: argocd
diff --git a/manifests/base/redis/argocd-redis-network-policy.yaml b/manifests/base/redis/argocd-redis-network-policy.yaml
index 145487474..bdb4ae9b8 100644
--- a/manifests/base/redis/argocd-redis-network-policy.yaml
+++ b/manifests/base/redis/argocd-redis-network-policy.yaml
@@ -1,6 +1,8 @@
 kind: NetworkPolicy
 apiVersion: networking.k8s.io/v1
 metadata:
+  annotations:
+    argocd.argoproj.io/sync-wave: "-1"
   name: argocd-redis-network-policy
 spec:
   podSelector:
diff --git a/manifests/base/redis/argocd-redis-role.yaml b/manifests/base/redis/argocd-redis-role.yaml
index a7a33f48a..c19c4356a 100644
--- a/manifests/base/redis/argocd-redis-role.yaml
+++ b/manifests/base/redis/argocd-redis-role.yaml
@@ -1,6 +1,8 @@
 apiVersion: rbac.authorization.k8s.io/v1
 kind: Role
 metadata:
+  annotations:
+    argocd.argoproj.io/sync-wave: "-1"
   labels:
     app.kubernetes.io/component: redis
     app.kubernetes.io/name: argocd-redis
diff --git a/manifests/base/redis/argocd-redis-rolebinding.yaml b/manifests/base/redis/argocd-redis-rolebinding.yaml
index f396914df..68a84cfe6 100644
--- a/manifests/base/redis/argocd-redis-rolebinding.yaml
+++ b/manifests/base/redis/argocd-redis-rolebinding.yaml
@@ -1,6 +1,8 @@
 apiVersion: rbac.authorization.k8s.io/v1
 kind: RoleBinding
 metadata:
+  annotations:
+    argocd.argoproj.io/sync-wave: "-1"
   labels:
     app.kubernetes.io/component: redis
     app.kubernetes.io/name: argocd-redis

Screenshots

Version

Upgrading from:

% k exec argocd-application-controller-0 -- /usr/local/bin/argocd version
time="2024-09-05T12:47:14Z" level=fatal msg="Argo CD server address unspecified"
argocd: v2.10.9+c071af8
  BuildDate: 2024-04-30T15:53:28Z
  GitCommit: c071af808170bfc39cbdf6b9be4d0212dd66db0c
  GitTreeState: clean
  GoVersion: go1.21.3
  Compiler: gc
  Platform: linux/amd64
command terminated with exit code 1
%

Upgrading to:

% k exec argocd-application-controller-0 -- /usr/local/bin/argocd version
time="2024-09-05T12:47:41Z" level=fatal msg="Argo CD server address unspecified"
argocd: v2.12.3+6b9cd82
  BuildDate: 2024-08-27T11:57:48Z
  GitCommit: 6b9cd828c6e9807398869ad5ac44efd2c28422d6
  GitTreeState: clean
  GoVersion: go1.22.4
  Compiler: gc
  Platform: linux/amd64
command terminated with exit code 1
%

Logs

Paste any relevant application logs here.
andrii-korotkov-verkada commented 2 weeks ago

I don't see the sync waves in master anymore, so maybe try getting new manifests and/or upgrading to v2.13.0.

akloss-cibo commented 2 weeks ago

I think there's a misunderstanding. There are no sync-waves in upstream; i have added sync-wave annotations to our local kustomization to make this work for us.

andrii-korotkov-verkada commented 2 weeks ago

@akloss-cibo, okay, wasn't sure if that was the case. Thanks for confirming. Can you define sync waves to get the proper order? If not, what's blocking?

akloss-cibo commented 2 weeks ago

Yes, the sync-wave changes in my original post to cause redis to install prior to the argocd-application-controller DaemonSet work for us.

akloss-cibo commented 2 weeks ago

I'm confused by closing this. Yes, we can mitigate the problem by applying our own sync-waves, but it seems to me like the provided kustomization should work for this use case out-of-the-box.

andrii-korotkov-verkada commented 2 weeks ago

Re-opening. Do you suggest to just add sync waves to the manifests in upstream for people who manage argocd by argocd?

akloss-cibo commented 2 weeks ago

I do, yes. I think the folks who engineered the password-enabled redis should probably be consulted to see if they have some other strategy they'd like to see used.

mtang-pton commented 2 days ago

Just wanted to add that I've run into this issue as well while upgrading self managed Argo from 2.10 -> 2.12. In our case, the upgrade went through in stage but failed in prod, leaving our prod Argo in a bad state while we attempted to debug the issue. I ended up having to disable redis auth to get argo-application-controller working before enabling redis auth again.

I'd like to see the sync waves added upstream and bugfixed in 2.12/2.13.

crenshaw-dev commented 2 days ago

For folks who hit this issue: how did you upgrade Argo CD? It doesn't make sense to me that OP's argocd-redis was out-of-sync while argocd-application-controller was in sync (on the newer version). Did you do a partial sync that didn't include argocd-redis?