argoproj / argo-cd

Declarative Continuous Deployment for Kubernetes
https://argo-cd.readthedocs.io
Apache License 2.0
17.47k stars 5.31k forks source link

ArgoCD deleting an Application on Secret token Update #10167

Open ayan4all opened 2 years ago

ayan4all commented 2 years ago

Hi Friends,

Introduction We need your expert guidance on one of the Argocd issues that we have faced on production recently. We have deployed Argocd on Openshift Version 4.8.43 cluster, where Argocd is running as containers.

Version of Argocd is as below

argocd: v2.0.0-rc1+0ca643f
BuildDate: 2021-03-19T21:27:59Z
GitCommit: 0ca643f027b99e8a5b931bb8ee9df42c3e4b64bf
GitTreeState: clean
GoVersion: go1.16
Compiler: gc
Platform: linux/amd64

Background about the issue

We are a Platform administration team where we deploy all infra related/Openshift cluster related changes via code using Github and argocd pushes it to the cluster after the PR is merged. The token on the Argocd expires on a monthly basis and as part of a Monthly activity we update the argocd token. We have been doing this for few months now without any issues but unfortunately for one production cluster the application within argocd(cpaas-tenants) got deleted. This has all the customer namespaces & containers defined within this application(cpaas-tenants) and due to this all Openshift namespaces on the cluster got deleted which caused an impact.

We have NOT explicitly provided any delete commands in CLI or selected a delete option in GUI. We are trying to assess how this could have happened. Appreciate if you can provide your inputs.

The following are the steps executed to renew the argocd token.

1. We got "AVP_SECRET_ID" from Vault
2. We took that secret ID and Applied on "argocd-vault-plugin-credentials" this secret file using Openshit GUI.
3. After Applied "AVP_SECRET_ID" on the secrets, we have restarted all argocd deployment pods which were running using below command.
   oc delete <podname> -n <namespace>
   Note: Namespace is the argocd namespace

4. To verify if the secret is applied on Openshift Cluster or not, we logged into argocd and checked the "Sync window" is healthy or not.

We had raised a case with Redhat who have suggested that argocd has excessive permissions which we are implementing a change to restrict Delete access. For now, we are trying to determine how the above steps to renew the token would have deleted the argocd application which in turn has got all the namespaces deleted.

Logs entries on Openshift

grpc.request.content="name:\"cpaas-tenants\" cascade:true propagationPolicy:\"foreground\" "
time="2022-07-28T09:55:48Z" level=info msg="bachamn deleted application" application=cpaas-tenants dest-namespace=default dest-server=https://kubernetes.default.svc reason=ResourceDeleted type=Normal
rouke-broersma commented 2 years ago

The logs suggests one of your users with account name bachamn clicked delete in the argocd UI on the application containing all your resources. This smells like user error. Someone is either lying about a mistake they made or has forgotten they've done this.

Also restarting argocd after refreshing credentials is in my experience unnecessary.

ayan4all commented 2 years ago

@rouke-broersma - Thanks for your reply. We had inquired with the user "bachamn" on this. He said he had updated the secret from GUI and after restarting the ArgoCD pods, he only logged into ArgoCD GUI to check the sync window,
This pod restart was needed for "argo-cd-repo-server" pod mainly, otherwise it was picking up the old secret from its env variables.

Is there any possibility of similar log can get generated due to some other action via ArgoCD, from CLI or via GUI - like "Hard refresh" or any other possible issue, we might be missing. Please suggest.

rouke-broersma commented 2 years ago

A maintainer of argocd would have to answer that question because I don't know, but to me the log message is very clear. It specifies this user initiated the delete. If it was initiated by auto sync I would expect it to say initiated by auto sync or something like that. Now it could be that if the user initiates a hard refresh, and the application does not exist in the git repo, that argo classifies the auto prune delete action as initiated by that user. But I do not know if it does and I would not expect that to be the case.

It also seems unlikely to me because the foreground + cascade grpc call that is also in the log is exactly what happens when a delete action is started in the argocd webui.