argoproj / argo-cd

Declarative Continuous Deployment for Kubernetes
https://argo-cd.readthedocs.io
Apache License 2.0
17.49k stars 5.32k forks source link

Argo CD automated pruning is pruning a resource that still exists in the GitHub Repository #14090

Open qoehliang opened 1 year ago

qoehliang commented 1 year ago

Checklist:

Describe the bug

Anytime we merge a Pull Request to the master branch, Argo CD on one of many EKS clusters decides to prune an object that was not touched in GitHub (it is still present in GitHub repository).

To Reproduce

  1. Deploy Application with Automated Pruning enabled, example below:

    project: core-components
    source:
    repoURL: >-
    https://.........<removed for security reasons>
    path: helmfile/rendered/non-prod-v1.23-monitor/cluster-autoscaler
    targetRevision: HEAD
    plugin:
    name: envsubst
    destination:
    server: 'https://kubernetes.default.svc'
    namespace: kube-system
    syncPolicy:
    automated:
    prune: true
    selfHeal: true
    retry:
    limit: 3
    backoff:
      duration: 5s
      factor: 2
      maxDuration: 3m
  2. Merge a new commit to master branch.

  3. Issue will happen in 1 or 2 of 40 EKS clusters we manage. For context, we use Argo CD to deploy a set of core services as we manage a platform for the company. I.E. We deploy the open source Cluster Autoscaler to all 40 EKS clusters using Argo CD.

This is happening almost anytime we make a push to the master branch of the repository that holds all of the Kubernetes manifests files for our services.

Expected behavior

Argo CD should perform an automated sync anytime we merge to master (that is currently happening as expected) but Argo CD should not see any changes and therefore not prune the resource.

Screenshots

We can see an automated sync succeeds and prunes the ServiceAccount of Cluster Autoscaler but nothing else. image

Clicking into the Revision, we can see the PR that merged that triggered the automated sync: image

You can see in the PR only 4 files changed, and it was for a dev cluster change and for the alertmanager-extras application which is nothing to do with Cluster Autoscaler.

The Cluster Autoscaler that got pruned was in a non-production environment, and uses the path mentioned in the application manifest above: helmfile/rendered/non-prod-v1.23-monitor/cluster-autoscaler. Checking the path you can see the last change was 23 days ago, not today which is when the issue happened. image

Performing another sync does not re-create the object. It feels like the cache has been made invalid. I can then do a hard refresh and the issue goes away until the next merge to the GitHub repository. image

Performing a regular Refresh also doesn't do anything.

Performing a Hard Refresh does bring the object back as you can see the ServiceAccount is a few seconds old compared to the Service which is 1 year old. I read that Hard Refresh invalidates the cache so it goes back to my concern that somehow after a Git commit to master, Argo CD is losing track of what objects should be deployed for what application. I dont suspect any connectivity problems, because I would have thought that the whole Application would be marked missing, and not just 1 object in an Application.

image

Version

We are using Argo CD 2.5.5 and planning to upgrade to 2.6.6 shortly. image

% argocd version
argocd: v2.6.4+7be094f.dirty
  BuildDate: 2023-03-07T23:43:59Z
  GitCommit: 7be094f38d06859b594b98eb75c7c70d39b80b1e
  GitTreeState: dirty
  GoVersion: go1.20.2
  Compiler: gc
  Platform: darwin/arm64

Logs

# Can see the dry-run detects the ServiceAccount as obj->nil
time="2023-06-15T23:23:45Z" level=info msg="Tasks (dry-run)" application=argocd/cluster-autoscaler syncId=00006-fjTPP tasks="[Sync/0 resource policy/PodDisruptionBudget:kube-system/cluster-autoscaler obj->obj (,,), Sync/0 resource /ServiceAccount:kube-system/cluster-autoscaler obj->nil (,,), Sync/0 resource rbac.authorization.k8s.io/ClusterRole:kube-system/cluster-autoscaler obj->obj (,,), Sync/0 resource rbac.authorization.k8s.io/ClusterRoleBinding:kube-system/cluster-autoscaler obj->obj (,,), Sync/0 resource rbac.authorization.k8s.io/Role:kube-system/cluster-autoscaler obj->obj (,,), Sync/0 resource rbac.authorization.k8s.io/RoleBinding:kube-system/cluster-autoscaler obj->obj (,,), Sync/0 resource /Service:kube-system/cluster-autoscaler obj->obj (,,), Sync/0 resource apps/Deployment:kube-system/cluster-autoscaler obj->obj (,,), Sync/0 resource monitoring.coreos.com/ServiceMonitor:monitoring/cluster-autoscaler obj->obj (,,)]"

# Then it subsequently prunes ServiceAccount
time="2023-06-15T23:23:45Z" level=info msg="Adding resource result, status: 'Pruned', phase: 'Succeeded', message: 'pruned'" application=argocd/cluster-autoscaler kind=ServiceAccount name=cluster-autoscaler namespace=kube-system phase=Sync syncId=00006-fjTPP

ca.log

YoranSys commented 1 year ago

@qoehliang, perhaps you have encountered the same issue as us. Could you please check all the commit IDs in your history?

On our end, we have 80 applications generated by an ApplicationSet, and sometimes there is one or more applications that use a previous commit ID from about three weeks ago. We are uncertain whether this issue will be resolved by https://github.com/argoproj/argo-cd/pull/13452.

crenshaw-dev commented 1 year ago

@YoranSys that PR fixes a bug that only currently exists in the master branch, not on any released version. What version are you running?

YoranSys commented 1 year ago

Hello @crenshaw-dev we use v2.7.3+e7891b8.dirty but we started to notice this problem on v2.6.4.

crenshaw-dev commented 1 year ago

Do you use a CMP, like OP? It's possible there's a cache issue specific to CMPs.

YoranSys commented 1 year ago

@crenshaw-dev, I'm currently employing an ApplicationSet with the rollingSync option and Helm integration, where I store some values in S3 and other inside git (using github and master branch). I use a GitHub repository with the master branch as the generator (never see issue on the generator). Despite my efforts to clear the Redis cache multiple times, the problem continues to occur.

qoehliang commented 1 year ago

@YoranSys, thanks for sharing your observations. I haven't seen an application using an old commit, and the behaviour we have observed is not the state of an application going back several commits but more a ServiceAccount or ClusterRole getting pruned and then only being restarted after a manual "hard refresh" or a restart of the Argo CD pods.

We have 40 or so applications configured under 1 GitHub repository, but do not use ApplicationSets. We follow an app-of-apps approach. With that said, as @crenshaw-dev mentioned, we are using a Config Management Plugin (envsubst) via ConfigMap plugin approach. Not sure if that is where the culprit lies, as we have noticed that ConfigMap plugins are being deprecated.

The weird thing is, we have other GitHub repositories which are not impacted by this issue. There is another GitHub repository with only 10 or so applications deployed to the same EKS cluster, and it has been running perfectly. So scalability/size of the repository may also be playing a part in the issue. We are in the process of disabling automated pruning because of the inconsistency and unreliability of the feature in our clusters.

crenshaw-dev commented 1 year ago

and Helm integration, where I store some values in S3 and other inside git

@YoranSys is that a CMP? afaik, Argo's built-in Helm support offers no way to communicate with S3?

via ConfigMap plugin approach. Not sure if that is where the culprit lies, as we have noticed that ConfigMap plugins are being deprecated.

@qoehliang they're deprecated, but they should still work fine. I suspect that some error in the CMP is causing it to return an empty manifest but not a non-zero exit code. So Argo CD is like "word, there are no resources, prune 'em all!"

Are you observing that all resources in the app are pruned, or just some? Because if it's just some, my theory is wrong.

dayyeung commented 1 year ago

We observe the same issue. We do not have auto prune enabled, but we consistently see some apps having all its resources marked as to-be-pruned.

CryptoTr4der commented 8 months ago

Hi! We are experiencing a similar issue with ArgoCD v2.9.3. We've observed that it randomly prunes several applications (all resources), resulting in their states changing to 'Missing', and then attempts to redeploy each resource again after a certain period. As a workaround, we have generally disabled pruning. Additionally, it's worth noting that we are utilizing the CMP Plugin argocd-vault-plugin (custom sidecar image)

gusfcarvalho commented 7 months ago

I see a similar issue here on argocd v2.8.4. The prunning here seems to happen only on Cluster-Scoped resources, and basically because argoCD queries them as being Namespaced instead of cluster scoped. This Causes ArgoCD to prune the 'namespaced' version to then reacreate the resource as a cluster-scoped version. I'm not sure my particular issue is actually an ArgoCD bug, but just wanted to add to the thread as it may help others.