argoproj / argo-cd

Declarative Continuous Deployment for Kubernetes
https://argo-cd.readthedocs.io
Apache License 2.0
16.76k stars 5.08k forks source link

Argo CD 2.5.10 Incorrectly Pruned #15058

Open rumstead opened 10 months ago

rumstead commented 10 months ago

Checklist:

Describe the bug

We saw Argo CD mass prune an entire application incorrectly. Sadly, these resources caused a rather large cascading effect, deleting the majority of our Kubernetes deployments.

After a hard refresh, Argo CD correctly identified that the manifests it pruned, shouldn't have been pruned. Additionally, a git ls-files confirms the files that were pruned are on the commit.

The log message below is very similar to this issue #2573 and I have a feeling it's semi-related. We are planning on upgrading to 2.8 soon but wanted to see if anyone knows of an issue or fix that was put in v2.5.10+ already.

To Reproduce Been trying :-/

Expected behavior Manifests in Git shouldn't be pruned

Screenshots Attached are two screenshots showing the prune and then Argo CD saying they are out of sync after a hard sync.

Version

argocd-server: v2.5.10+d311fad
  BuildDate: 2023-02-02T15:10:21Z
  GitCommit: d311fad538178ef97528edf7f09d1d5b57268f45
  GitTreeState: clean
  GoVersion: go1.18.10
  Compiler: gc
  Platform: linux/amd64
  Kustomize Version: v4.5.7 2022-08-02T16:35:48Z
  Helm Version: v3.10.3+g835b733
  Kubectl Version: v0.24.2
  Jsonnet Version: v0.18.0

Logs

time="2023-08-12T17:25:47Z" level=info msg="manifest cache miss: &ApplicationSource{RepoURL:https://...,Path:apps/,TargetRevision:HEAD,Helm:nil,Kustomize:nil,Directory:&ApplicationSourceDirectory{Recurse:true,Jsonnet:ApplicationSourceJsonnet{ExtVars:[]JsonnetVar{},TLAs:[]JsonnetVar{},Libs:[],},Exclude:,Include:,},Plugin:nil,Chart:,}/9b9ff97b11d83373bdd4d05fadcf20c791e5600e"

time="2023-08-12T17:26:25Z" level=info msg=Syncing application=namespace/tenants skipHooks=false started=false syncId=00479-rtgeN

time="2023-08-12T17:26:25Z" level=info msg="Updating operation state. phase: Running -> Running, message: '' -> 'one or more tasks are running'" application=namespace/tenants syncId=00479-rtgeN

2023-08-12T17:26:25 ons-service obj->nil (,,), Sync/0 resource internal.com/Tenant:/app-ui obj->nil (,,), Sync/0 resource internal.com/Tenant:/preview-portal obj->nil (,,), Sync/0 resource internal.com/Tenant:/printing-press obj->nil (,,), Sync/0 resource internal.com/Tenant:/product-service-tenant obj->nil (,,), Sync/0 resource internal.com/Tenant:/provider-admin-api-tenant obj->nil (,,), Sync/0 resource internal.com/Tenant:/provider-call-or-distribution-validation-service-tenant obj->nil (,,), Sync/0 resource internal.com/Tenant:/provider-email-service-tenant obj->nil (,,)...]"

time="2023-08-12T17:26:25Z" level=info msg="Adding resource result, status: 'Pruned', phase: 'Succeeded', message: 'pruned'" application=namespace/tenants kind=Tenant name=provider-email-service-tenant namespace= phase=Sync syncId=00479-rtgeN

... many many more pruned

time="2023-08-12T17:26:26Z" level=info msg="sync/terminate complete" application=namespace/tenants duration=1.014547622s syncId=00479-rtgeN

time="2023-08-12T17:26:26Z" level=info msg="updated 'namespace/tenants' operation (phase: Succeeded)"

time="2023-08-12T17:25:52Z" level=info msg="Reconciliation completed" application=namespace/tenants dedup_ms=3 dest-
name= dest-namespace=namespace2 dest-server="https://kubernetes.default.svc" diff_ms=854 fields.level=2 git_ms=5581 health_ms=2 live_ms=90 settings_ms=0 sync_ms=3 time_ms=10483
Screenshot 2023-08-12 at 7 46 45 PM copy Screenshot 2023-08-12 at 7 46 15 PM copy
crenshaw-dev commented 10 months ago

It deleted every single resource? I thought there was a failsafe which would block syncs if every resource was marked to be pruned.

rumstead commented 10 months ago

It deleted every single resource? I thought there was a failsafe which would block syncs if every resource was marked to be pruned.

it deleted every resource :-/ deployed by that application and then the finalizers took over downstream.

crenshaw-dev commented 10 months ago

I feel like the only real hint from the logs is all the obj->nil messages.

rumstead commented 10 months ago

Agreed, from my understanding of the code:

  1. I believe we are seeing the obj->nil logs from here based on the String() method.
    • where the live obj isn't nil but the target obj is since we see obj -> nil
  2. here is where i believe we are getting the nil results
    • I have no idea what local manifests are and just assume they are empty because i can't find them really anywhere in the code other than saying its for testing
crenshaw-dev commented 10 months ago

Local manifests support the argocd app sync --local feature, i.e. temporary overrides from a local manifest source. Probably not relevant in this case.

Feels like the problem must be similar to https://github.com/argoproj/argo-cd/issues/2573 - some error being ignored silently.

rumstead commented 10 months ago

The resources deployed are a CRD that we created. I ran them through the code a few times without any issues unmarshaling and after the hard refresh it worked. It didn't like something :-/.

Do you happen to know if there is a code to safeguard against pruning everything?

crenshaw-dev commented 10 months ago

There is https://github.com/argoproj/argo-cd/blob/94b986da2e99e0b0a76a9d3bae5032c04d6c4395/controller/appcontroller.go#L1728

rumstead commented 10 months ago

The CRDs are cluster scoped

❯ k api-resources --api-group=internal.com
NAME      SHORTNAMES   APIVERSION               NAMESPACED   KIND
tenants   tn,tnt       internal.com/v1alpha1   false        Tenant

one thing we also noticed.

In the prune case, the sync tasks look like this (ie, a missing namespace between the objects)

 Sync/0 resource internal.com/Tenant:/app-ui obj->nil

After the hard refresh, they look like this

 Sync/0 resource internal.com/Tenant:namespace/app-ui obj->obj
jannfis commented 10 months ago

Hm. Did you make changes to the CRD recently to move from namespaced to non-namespaced?

Figuring that the prune happened against revision 9b9ff97b11d83373bdd4d05fadcf20c791e5600e, have you had a look at repository server logs? Have there been some errors recorded for rendering this particular revision?

rumstead commented 10 months ago

Hm. Did you make changes to the CRD recently to move from namespaced to non-namespaced?

Figuring that the prune happened against revision 9b9ff97b11d83373bdd4d05fadcf20c791e5600e, have you had a look at repository server logs? Have there been some errors recorded for rendering this particular revision?

No recent changes in the CRD, it has always been cluster scoped but the Argo CD Application that deploys the Tenant resources has a namespace defined in it (though it always has). No errors in the repo server logs against that commit :(.

The only logs I have tracked own in the repo server are the GenerateManfiest grpc call from getRepoObjs and the subsequent git checkout --force.

time="2023-08-12T17:25:47Z" level=info msg="manifest cache miss: &ApplicationSource{RepoURL:https://...,Path:apps/,TargetRevision:HEAD,Helm:nil,Kustomize:nil,Directory:&ApplicationSourceDirectory{Recurse:true,Jsonnet:ApplicationSourceJsonnet{ExtVars:[]JsonnetVar{},TLAs:[]JsonnetVar{},Libs:[],},Exclude:,Include:,},Plugin:nil,Chart:,}/9b9ff97b11d83373bdd4d05fadcf20c791e5600e"

and then further cache hits

time="2023-08-12T17:25:54Z" level=info msg="manifest cache hit: &ApplicationSource{RepoURL:https://...,Path:apps/,TargetRevision:HEAD,Helm:nil,Kustomize:nil,Directory:&ApplicationSourceDirectory{Recurse:true,Jsonnet:ApplicationSourceJsonnet{ExtVars:[]JsonnetVar{},TLAs:[]JsonnetVar{},Libs:[],},Exclude:,Include:,},Plugin:nil,Chart:,}/9b9ff97b11d83373bdd4d05fadcf20c791e5600e"

EDIT: since targetObj was nil, it uses liveObj. Since liveObj doesn't have the namespace (its cluster scoped) no namespace is used

EDIT2: Prune does indeed work... (duh!)

Screenshot 2023-08-15 at 3 49 09 PM

rumstead commented 10 months ago

We have tried relentlessly to reproduce this and haven't been able to. I am going to close it and will reopen if we get more details. Thanks for everyone's suggestions. Hopefully, we don't see this again.

joantomas commented 2 months ago

Hi,

sorry to reopen this issue but we have experimented the same behaviour. In the logs we see that application-controller ends a Reconciliation in both pods (we have HA) without changes:

Log PodName
time="2024-04-23T11:13:09Z" level=info msg="Comparing app state (cluster: https://kubernetes.default.svc, namespace: sdk-builder)" application=argocd/sdk-builder argocd-application-controller-0
time="2024-04-23T11:13:09Z" level=info msg="Refreshing app status (controller refresh requested), level (1)" application=argocd/sdk-builder argocd-application-controller-0
time="2024-04-23T11:13:09Z" level=info msg="Refreshing app status (controller refresh requested), level (1)" application=argocd/sdk-builder argocd-application-controller-1
time="2024-04-23T11:13:09Z" level=info msg="Comparing app state (cluster: https://kubernetes.default.svc, namespace: sdk-builder)" application=argocd/sdk-builder argocd-application-controller-1
time="2024-04-23T11:13:09Z" level=info msg="manifest cache hit: &ApplicationSource{RepoURL:git@github.com:MyGitAccount/myNiceRepo.git,Path:./,TargetRevision:HEAD,Helm:nil,Kustomize:nil,Directory:nil,Plugin:&ApplicationSourcePlugin{Name:DecryptFromSopsForPreviewEnv,Env:[]*EnvEntry{},Parameters:[]ApplicationSourcePluginParameter{},},Chart:,Ref:,}/1e189c27189403969bd31706760e7a43c2fc833a" argocd-repo-server-667cd4fb45-89pb2
time="2024-04-23T11:13:09Z" level=info msg="manifest cache hit: &ApplicationSource{RepoURL:git@github.com:MyGitAccount/myNiceRepo.git,Path:./,TargetRevision:HEAD,Helm:nil,Kustomize:nil,Directory:nil,Plugin:&ApplicationSourcePlugin{Name:DecryptFromSopsForPreviewEnv,Env:[]*EnvEntry{},Parameters:[]ApplicationSourcePluginParameter{},},Chart:,Ref:,}/1e189c27189403969bd31706760e7a43c2fc833a" argocd-repo-server-667cd4fb45-89pb2
time="2024-04-23T11:13:09Z" level=info msg="getRepoObjs stats" application=argocd/sdk-builder build_options_ms=0 helm_ms=0 plugins_ms=0 repo_ms=0 time_ms=11 unmarshal_ms=10 version_ms=0 argocd-application-controller-1
time="2024-04-23T11:13:09Z" level=info msg="getRepoObjs stats" application=argocd/sdk-builder build_options_ms=0 helm_ms=8 plugins_ms=0 repo_ms=0 time_ms=16 unmarshal_ms=8 version_ms=0 argocd-application-controller-0
time="2024-04-23T11:13:09Z" level=info msg="Skipping auto-sync: application status is Synced" application=argocd/sdk-builder argocd-application-controller-0
time="2024-04-23T11:13:09Z" level=info msg="Skipping auto-sync: application status is Synced" application=argocd/sdk-builder argocd-application-controller-1
time="2024-04-23T11:13:09Z" level=info msg="No status changes. Skipping patch" application=argocd/sdk-builder argocd-application-controller-0
time="2024-04-23T11:13:09Z" level=info msg="Reconciliation completed" application=argocd/sdk-builder dedup_ms=0 dest-name= dest-namespace=sdk-builder dest-server="https://kubernetes.default.svc" diff_ms=35 fields.level=1 git_ms=16 health_ms=2 live_ms=9 settings_ms=0 sync_ms=0 time_ms=92 argocd-application-controller-0
time="2024-04-23T11:13:09Z" level=info msg="Reconciliation completed" application=argocd/sdk-builder dedup_ms=0 dest-name= dest-namespace=sdk-builder dest-server="https://kubernetes.default.svc" diff_ms=43 fields.level=1 git_ms=11 health_ms=2 live_ms=5 settings_ms=0 sync_ms=0 time_ms=91 argocd-application-controller-1
time="2024-04-23T11:13:09Z" level=info msg="No status changes. Skipping patch" application=argocd/sdk-builder argocd-application-controller-1

After just 10 seconds, both application servers starts again the resync process to the same git commit (1e189c27189403969bd31706760e7a43c2fc833a). In that case, the repo server reports a cache miss and starts a new sync which starts to prune resources:

Log PodName
time="2024-04-23T11:13:14Z" level=info msg="Start processing" resource=argocd/sdk-builder argocd-notifications-controller-65cffbdbdd-qh8fk
time="2024-04-23T11:13:14Z" level=info msg="Processing completed" resource=argocd/sdk-builder argocd-notifications-controller-65cffbdbdd-qh8fk
time="2024-04-23T11:13:24Z" level=info msg="Comparing app state (cluster: https://kubernetes.default.svc, namespace: sdk-builder)" application=argocd/sdk-builder argocd-application-controller-0
time="2024-04-23T11:13:24Z" level=info msg="Refreshing app status (controller refresh requested), level (1)" application=argocd/sdk-builder argocd-application-controller-0
time="2024-04-23T11:13:24Z" level=info msg="manifest cache miss: &ApplicationSource{RepoURL:git@github.com:MyGitAccount/myNiceRepo.git,Path:./,TargetRevision:HEAD,Helm:nil,Kustomize:nil,Directory:nil,Plugin:&ApplicationSourcePlugin{Name:DecryptFromSopsForPreviewEnv,Env:[]*EnvEntry{},Parameters:[]ApplicationSourcePluginParameter{},},Chart:,Ref:,}/1e189c27189403969bd31706760e7a43c2fc833a" argocd-repo-server-667cd4fb45-89pb2
time="2024-04-23T11:13:24Z" level=info msg="Refreshing app status (controller refresh requested), level (1)" application=argocd/sdk-builder argocd-application-controller-1
time="2024-04-23T11:13:24Z" level=info msg="Comparing app state (cluster: https://kubernetes.default.svc, namespace: sdk-builder)" application=argocd/sdk-builder argocd-application-controller-1
time="2024-04-23T11:13:24Z" level=info msg="manifest cache miss: &ApplicationSource{RepoURL:git@github.com:MyGitAccount/myNiceRepo.git,Path:./,TargetRevision:HEAD,Helm:nil,Kustomize:nil,Directory:nil,Plugin:&ApplicationSourcePlugin{Name:DecryptFromSopsForPreviewEnv,Env:[]*EnvEntry{},Parameters:[]ApplicationSourcePluginParameter{},},Chart:,Ref:,}/1e189c27189403969bd31706760e7a43c2fc833a" argocd-repo-server-667cd4fb45-89pb2
time="2024-04-23T11:13:26Z" level=info msg="manifest cache miss: &ApplicationSource{RepoURL:git@github.com:MyGitAccount/myNiceRepo.git,Path:./,TargetRevision:HEAD,Helm:nil,Kustomize:nil,Directory:nil,Plugin:&ApplicationSourcePlugin{Name:DecryptFromSopsForPreviewEnv,Env:[]*EnvEntry{},Parameters:[]ApplicationSourcePluginParameter{},},Chart:,Ref:,}/1e189c27189403969bd31706760e7a43c2fc833a" argocd-repo-server-667cd4fb45-89pb2
time="2024-04-23T11:13:26Z" level=info msg="manifest cache miss: &ApplicationSource{RepoURL:git@github.com:MyGitAccount/myNiceRepo.git,Path:./,TargetRevision:HEAD,Helm:nil,Kustomize:nil,Directory:nil,Plugin:&ApplicationSourcePlugin{Name:DecryptFromSopsForPreviewEnv,Env:[]*EnvEntry{},Parameters:[]ApplicationSourcePluginParameter{},},Chart:,Ref:,}/1e189c27189403969bd31706760e7a43c2fc833a" argocd-repo-server-667cd4fb45-89pb2
time="2024-04-23T11:13:28Z" level=info msg="getRepoObjs stats" application=argocd/sdk-builder build_options_ms=0 helm_ms=0 plugins_ms=0 repo_ms=0 time_ms=3635 unmarshal_ms=3635 version_ms=0 argocd-application-controller-0
time="2024-04-23T11:13:28Z" level=info msg="Start processing" resource=argocd/sdk-builder argocd-notifications-controller-65cffbdbdd-qh8fk
time="2024-04-23T11:13:28Z" level=info msg="Processing completed" resource=argocd/sdk-builder argocd-notifications-controller-65cffbdbdd-qh8fk
time="2024-04-23T11:13:28Z" level=info msg="Initiated automated sync to '1e189c27189403969bd31706760e7a43c2fc833a'" application=sdk-builder dest-namespace=sdk-builder dest-server="https://kubernetes.default.svc" reason=OperationStarted type=Normal argocd-application-controller-0
time="2024-04-23T11:13:28Z" level=info msg="Initiated automated sync to '1e189c27189403969bd31706760e7a43c2fc833a'" application=argocd/sdk-builder argocd-application-controller-0
time="2024-04-23T11:13:28Z" level=info msg="Updated sync status: Synced -> OutOfSync" application=sdk-builder dest-namespace=sdk-builder dest-server="https://kubernetes.default.svc" reason=ResourceUpdated type=Normal argocd-application-controller-0
time="2024-04-23T11:13:28Z" level=info msg="Processing completed" resource=argocd/sdk-builder argocd-notifications-controller-65cffbdbdd-qh8fk
time="2024-04-23T11:13:28Z" level=info msg="Start processing" resource=argocd/sdk-builder argocd-notifications-controller-65cffbdbdd-qh8fk
time="2024-04-23T11:13:28Z" level=info msg="Initialized new operation: {&SyncOperation{Revision:1e189c27189403969bd31706760e7a43c2fc833a,Prune:true,DryRun:false,SyncStrategy:nil,Resources:[]SyncOperationResource{SyncOperationResource{Group:,Kind:ServiceAccount,Name:sdk-builder,Namespace:,},SyncOperationResource{Group:cert-manager.io,Kind:Certificate,Name:sdk-builder-mrfio-https,Namespace:,},SyncOperationResource{Group:apps,Kind:Deployment,Name:sdk-builder,Namespace:,},SyncOperationResource{Group:rbac.authorization.k8s.io,Kind:RoleBinding,Name:sdk-builder-rolebinding,Namespace:,},SyncOperationResource{Group:networking.istio.io,Kind:Gateway,Name:sdk-builder-cluster,Namespace:,},SyncOperationResource{Group:networking.istio.io,Kind:VirtualService,Name:sdk-builder-leaderelection,Namespace:,},SyncOperationResource{Group:networking.istio.io,Kind:VirtualService,Name:sdk-builder-vs,Namespace:,},SyncOperationResource{Group:,Kind:Secret,Name:sdk-builder-env,Namespace:,},SyncOperationResource{Group:networking.istio.io,Kind:Gateway,Name:sdkmrfio-gw,Namespace:,},SyncOperationResource{Group:policy,Kind:PodDisruptionBudget,Name:sdk-builder-pdb,Namespace:,},SyncOperationResource{Group:,Kind:Namespace,Name:sdk-builder,Namespace:,},SyncOperationResource{Group:cert-manager.io,Kind:Certificate,Name:sdk-builder-cluster-https,Namespace:,},SyncOperationResource{Group:networking.istio.io,Kind:Gateway,Name:sdk-builder-mrfio,Namespace:,},SyncOperationResource{Group:networking.istio.io,Kind:ServiceEntry,Name:sdk-mrf-io-cloudfront,Namespace:,},SyncOperationResource{Group:networking.istio.io,Kind:Gateway,Name:sdk-builder-leaderelection,Namespace:,},SyncOperationResource{Group:networking.istio.io,Kind:DestinationRule,Name:sdk-mrfio-dest,Namespace:,},SyncOperationResource{Group:monitoring.coreos.com,Kind:ServiceMonitor,Name:appmetrics-sdk-builder,Namespace:,},SyncOperationResource{Group:networking.istio.io,Kind:DestinationRule,Name:sdk-builder,Namespace:,},SyncOperationResource{Group:rbac.authorization.k8s.io,Kind:Role,Name:sdk-builder-role,Namespace:,},SyncOperationResource{Group:cert-manager.io,Kind:Certificate,Name:sdk-builder.leaderelection.mrf.io,Namespace:,},SyncOperationResource{Group:cert-manager.io,Kind:Certificate,Name:sdkmrfio-https,Namespace:,},SyncOperationResource{Group:,Kind:Service,Name:sdk-builder,Namespace:,},SyncOperationResource{Group:autoscaling,Kind:HorizontalPodAutoscaler,Name:sdk-builder-autoscaler,Namespace:,},SyncOperationResource{Group:security.istio.io,Kind:AuthorizationPolicy,Name:sdk-builder,Namespace:,},},Source:nil,Manifests:[],SyncOptions:[ServerSideApply=true],Sources:[]ApplicationSource{},Revisions:[],} { true} [] {5 nil}}" application=argocd/sdk-builder argocd-application-controller-1
time="2024-04-23T11:13:28Z" level=info msg="Comparing app state (cluster: https://kubernetes.default.svc, namespace: sdk-builder)" application=argocd/sdk-builder argocd-application-controller-1
time="2024-04-23T11:13:28Z" level=info msg="updated 'argocd/sdk-builder' operation (phase: Running)" argocd-application-controller-1
time="2024-04-23T11:13:28Z" level=info msg="manifest cache hit: &ApplicationSource{RepoURL:git@github.com:MyGitAccount/myNiceRepo.git,Path:./,TargetRevision:HEAD,Helm:nil,Kustomize:nil,Directory:nil,Plugin:&ApplicationSourcePlugin{Name:DecryptFromSopsForPreviewEnv,Env:[]*EnvEntry{},Parameters:[]ApplicationSourcePluginParameter{},},Chart:,Ref:,}/1e189c27189403969bd31706760e7a43c2fc833a" argocd-repo-server-667cd4fb45-6q492
time="2024-04-23T11:13:28Z" level=info msg="getRepoObjs stats" application=argocd/sdk-builder build_options_ms=0 helm_ms=0 plugins_ms=0 repo_ms=0 time_ms=7 unmarshal_ms=7 version_ms=0 argocd-application-controller-1
time="2024-04-23T11:13:28Z" level=info msg="Initialized new operation: {&SyncOperation{Revision:1e189c27189403969bd31706760e7a43c2fc833a,Prune:true,DryRun:false,SyncStrategy:nil,Resources:[]SyncOperationResource{SyncOperationResource{Group:,Kind:ServiceAccount,Name:sdk-builder,Namespace:,},SyncOperationResource{Group:cert-manager.io,Kind:Certificate,Name:sdk-builder-mrfio-https,Namespace:,},SyncOperationResource{Group:apps,Kind:Deployment,Name:sdk-builder,Namespace:,},SyncOperationResource{Group:rbac.authorization.k8s.io,Kind:RoleBinding,Name:sdk-builder-rolebinding,Namespace:,},SyncOperationResource{Group:networking.istio.io,Kind:Gateway,Name:sdk-builder-cluster,Namespace:,},SyncOperationResource{Group:networking.istio.io,Kind:VirtualService,Name:sdk-builder-leaderelection,Namespace:,},SyncOperationResource{Group:networking.istio.io,Kind:VirtualService,Name:sdk-builder-vs,Namespace:,},SyncOperationResource{Group:,Kind:Secret,Name:sdk-builder-env,Namespace:,},SyncOperationResource{Group:networking.istio.io,Kind:Gateway,Name:sdkmrfio-gw,Namespace:,},SyncOperationResource{Group:policy,Kind:PodDisruptionBudget,Name:sdk-builder-pdb,Namespace:,},SyncOperationResource{Group:,Kind:Namespace,Name:sdk-builder,Namespace:,},SyncOperationResource{Group:cert-manager.io,Kind:Certificate,Name:sdk-builder-cluster-https,Namespace:,},SyncOperationResource{Group:networking.istio.io,Kind:Gateway,Name:sdk-builder-mrfio,Namespace:,},SyncOperationResource{Group:networking.istio.io,Kind:ServiceEntry,Name:sdk-mrf-io-cloudfront,Namespace:,},SyncOperationResource{Group:networking.istio.io,Kind:Gateway,Name:sdk-builder-leaderelection,Namespace:,},SyncOperationResource{Group:networking.istio.io,Kind:DestinationRule,Name:sdk-mrfio-dest,Namespace:,},SyncOperationResource{Group:monitoring.coreos.com,Kind:ServiceMonitor,Name:appmetrics-sdk-builder,Namespace:,},SyncOperationResource{Group:networking.istio.io,Kind:DestinationRule,Name:sdk-builder,Namespace:,},SyncOperationResource{Group:rbac.authorization.k8s.io,Kind:Role,Name:sdk-builder-role,Namespace:,},SyncOperationResource{Group:cert-manager.io,Kind:Certificate,Name:sdk-builder.leaderelection.mrf.io,Namespace:,},SyncOperationResource{Group:cert-manager.io,Kind:Certificate,Name:sdkmrfio-https,Namespace:,},SyncOperationResource{Group:,Kind:Service,Name:sdk-builder,Namespace:,},SyncOperationResource{Group:autoscaling,Kind:HorizontalPodAutoscaler,Name:sdk-builder-autoscaler,Namespace:,},SyncOperationResource{Group:security.istio.io,Kind:AuthorizationPolicy,Name:sdk-builder,Namespace:,},},Source:nil,Manifests:[],SyncOptions:[ServerSideApply=true],Sources:[]ApplicationSource{},Revisions:[],} { true} [] {5 nil}}" application=argocd/sdk-builder argocd-application-controller-0
time="2024-04-23T11:13:28Z" level=info msg="updated 'argocd/sdk-builder' operation (phase: Running)" argocd-application-controller-0
time="2024-04-23T11:13:28Z" level=info msg=Syncing application=argocd/sdk-builder skipHooks=true started=false syncId=05403-RNPms argocd-application-controller-1
time="2024-04-23T11:13:28Z" level=info msg="Comparing app state (cluster: https://kubernetes.default.svc, namespace: sdk-builder)" application=argocd/sdk-builder argocd-application-controller-0
time="2024-04-23T11:13:28Z" level=info msg="manifest cache hit: &ApplicationSource{RepoURL:git@github.com:MyGitAccount/myNiceRepo.git,Path:./,TargetRevision:HEAD,Helm:nil,Kustomize:nil,Directory:nil,Plugin:&ApplicationSourcePlugin{Name:DecryptFromSopsForPreviewEnv,Env:[]*EnvEntry{},Parameters:[]ApplicationSourcePluginParameter{},},Chart:,Ref:,}/1e189c27189403969bd31706760e7a43c2fc833a" argocd-repo-server-667cd4fb45-6q492
time="2024-04-23T11:13:28Z" level=info msg="Processing completed" resource=argocd/sdk-builder argocd-notifications-controller-65cffbdbdd-qh8fk
time="2024-04-23T11:13:28Z" level=info msg="Start processing" resource=argocd/sdk-builder argocd-notifications-controller-65cffbdbdd-qh8fk
time="2024-04-23T11:13:28Z" level=info msg="getRepoObjs stats" application=argocd/sdk-builder build_options_ms=0 helm_ms=0 plugins_ms=0 repo_ms=0 time_ms=7 unmarshal_ms=7 version_ms=0 argocd-application-controller-0
time="2024-04-23T11:13:28Z" level=info msg="Reconciliation completed" application=argocd/sdk-builder dedup_ms=0 dest-name= dest-namespace=sdk-builder dest-server="https://kubernetes.default.svc" diff_ms=24 fields.level=1 git_ms=3635 health_ms=2 live_ms=6 settings_ms=0 sync_ms=0 time_ms=3848 argocd-application-controller-0
time="2024-04-23T11:13:28Z" level=info msg="Update successful" application=argocd/sdk-builder argocd-application-controller-0
time="2024-04-23T11:13:28Z" level=info msg="Tasks (dry-run)" application=argocd/sdk-builder syncId=05403-RNPms tasks="[Sync/0 resource /Namespace:/sdk-builder obj->nil (,,), Sync/0 resource policy/PodDisruptionBudget:sdk-builder/sdk-builder-pdb obj->nil (,,), Sync/0 resource /ServiceAccount:sdk-builder/sdk-builder obj->nil (,,), Sync/0 resource /Secret:sdk-builder/sdk-builder-env obj->nil (,,), Sync/0 resource rbac.authorization.k8s.io/Role:sdk-builder/sdk-builder-role obj->nil (,,), Sync/0 resource rbac.authorization.k8s.io/RoleBinding:sdk-builder/sdk-builder-rolebinding obj->nil (,,), Sync/0 resource /Service:sdk-builder/sdk-builder obj->nil (,,), Sync/0 resource apps/Deployment:sdk-builder/sdk-builder obj->nil (,,), Sync/0 resource autoscaling/HorizontalPodAutoscaler:sdk-builder/sdk-builder-autoscaler obj->nil (,,), Sync/0 resource monitoring.coreos.com/ServiceMonitor:sdk-builder/appmetrics-sdk-builder obj->nil (,,), Sync/0 resource security.istio.io/AuthorizationPolicy:sdk-builder/sdk-builder obj->nil (,,), Sync/0 resource networking.istio.io/DestinationRule:sdk-builder/sdk-builder obj->nil (,,), Sync/0 resource networking.istio.io/Gateway:sdk-builder/sdk-builder-cluster obj->nil (,,), Sync/0 resource cert-manager.io/Certificate:istio-system/sdk-builder-cluster-https obj->nil (,,), Sync/0 resource networking.istio.io/VirtualService:sdk-builder/sdk-builder-leaderelection obj->nil (,,), Sync/0 resource networking.istio.io/Gateway:sdk-builder/sdk-builder-leaderelection obj->nil (,,), Sync/0 resource networking.istio.io/Gateway:sdk-builder/sdk-builder-mrfio obj->nil (,,), Sync/0 resource cert-manager.io/Certificate:istio-system/sdk-builder-mrfio-https obj->nil (,,), Sync/0 resource networking.istio.io/VirtualService:sdk-builder/sdk-builder-vs obj->nil (,,), Sync/0 resource cert-manager.io/Certificate:istio-system/sdk-builder.leaderelection.mrf.io obj->nil (,,), Sync/0 resource networking.istio.io/ServiceEntry:sdk-builder/sdk-mrf-io-cloudfront obj->nil (,,), Sync/0 resource networking.istio.io/DestinationRule:sdk-builder/sdk-mrfio-dest obj->nil (,,), Sync/0 resource networking.istio.io/Gateway:sdk-builder/sdkmrfio-gw obj->nil (,,), Sync/0 resource cert-manager.io/Certificate:istio-system/sdkmrfio-https obj->nil (,,)]" argocd-application-controller-1
time="2024-04-23T11:13:28Z" level=info msg="Updating operation state. phase: Running -> Running, message: '' -> 'one or more tasks are running'" application=argocd/sdk-builder syncId=05403-RNPms argocd-application-controller-1
time="2024-04-23T11:13:28Z" level=info msg="Adding resource result, status: 'Pruned', phase: 'Succeeded', message: 'pruned'" application=argocd/sdk-builder kind=Role name=sdk-builder-role namespace=sdk-builder phase=Sync syncId=05403-RNPms argocd-application-controller-1
time="2024-04-23T11:13:28Z" level=info msg="Comparing app state (cluster: https://kubernetes.default.svc, namespace: sdk-builder)" application=argocd/sdk-builder argocd-application-controller-0
time="2024-04-23T11:13:28Z" level=info msg="Refreshing app status (controller refresh requested), level (1)" application=argocd/sdk-builder argocd-application-controller-0
time="2024-04-23T11:13:28Z" level=info msg="Adding resource result, status: 'Pruned', phase: 'Succeeded', message: 'pruned'" application=argocd/sdk-builder kind=DestinationRule name=sdk-builder namespace=sdk-builder phase=Sync syncId=05403-RNPms argocd-application-controller-1
time="2024-04-23T11:13:28Z" level=info msg="Adding resource result, status: 'Pruned', phase: 'Succeeded', message: 'pruned'" application=argocd/sdk-builder kind=ServiceAccount name=sdk-builder namespace=sdk-builder phase=Sync syncId=05403-RNPms argocd-application-controller-1
time="2024-04-23T11:13:28Z" level=info msg="manifest cache hit: &ApplicationSource{RepoURL:git@github.com:MyGitAccount/myNiceRepo.git,Path:./,TargetRevision:HEAD,Helm:nil,Kustomize:nil,Directory:nil,Plugin:&ApplicationSourcePlugin{Name:DecryptFromSopsForPreviewEnv,Env:[]*EnvEntry{},Parameters:[]ApplicationSourcePluginParameter{},},Chart:,Ref:,}/1e189c27189403969bd31706760e7a43c2fc833a" argocd-repo-server-667cd4fb45-89pb2
time="2024-04-23T11:13:28Z" level=info msg="Adding resource result, status: 'Pruned', phase: 'Succeeded', message: 'pruned'" application=argocd/sdk-builder kind=RoleBinding name=sdk-builder-rolebinding namespace=sdk-builder phase=Sync syncId=05403-RNPms argocd-application-controller-1
time="2024-04-23T11:13:28Z" level=info msg="getRepoObjs stats" application=argocd/sdk-builder build_options_ms=0 helm_ms=0 plugins_ms=0 repo_ms=0 time_ms=8 unmarshal_ms=7 version_ms=0 argocd-application-controller-0
time="2024-04-23T11:13:28Z" level=info msg="Adding resource result, status: 'Pruned', phase: 'Succeeded', message: 'pruned'" application=argocd/sdk-builder kind=Certificate name=sdk-builder-mrfio-https namespace=istio-system phase=Sync syncId=05403-RNPms argocd-application-controller-1
time="2024-04-23T11:13:28Z" level=info msg="Adding resource result, status: 'Pruned', phase: 'Succeeded', message: 'pruned'" application=argocd/sdk-builder kind=Certificate name=sdkmrfio-https namespace=istio-system phase=Sync syncId=05403-RNPms argocd-application-controller-1
time="2024-04-23T11:13:28Z" level=info msg="Adding resource result, status: 'Pruned', phase: 'Succeeded', message: 'pruned'" application=argocd/sdk-builder kind=Gateway name=sdkmrfio-gw namespace=sdk-builder phase=Sync syncId=05403-RNPms argocd-application-controller-1
time="2024-04-23T11:13:28Z" level=info msg=Syncing application=argocd/sdk-builder skipHooks=true started=false syncId=05399-kEkVM argocd-application-controller-0

After that the application enters an inestable status where argo is not able to prune all the resources but, at the same time, the deployment is pruned so, not able to respond to requests until we force a manual sync with Argo. We have some questions that we are not able to answer:

Finally, we are using a quite old ArgoCD version (2.6.6) and we are planning to update during the next weeks, even though, we haven't been able to find any bug fix that could address this problem. Does it exist? We have attached the full logs if it could help (argocd.csv).

ArgoCD version:

argocd: v2.6.6+6d4de2e
  BuildDate: 2023-03-16T22:25:45Z
  GitCommit: 6d4de2ec5d49fa2c6823f2b7d101607a839be3fa
  GitTreeState: clean
  GoVersion: go1.18.10
  Compiler: gc
  Platform: linux/amd64

Many thanks for your help

rumstead commented 2 months ago

Very sorry this happened, it was extremely destructive for us. I feel a lot less crazy now that someone else experienced it.

Similar to what you alluded to, the only semi-explanation I could come up with was redis returning empty cached manifests. Idk if it was caused by the repo server storing empty manifests because it swallowed an error while rendering or if something failed while storing the manifests in redis.

I also cannot explain how the "don't prune everything check" was bypassed.

ericblackburn commented 2 months ago

To add to this, we have seen this bug happen in 2.10.7. For our situation, Argo showed a task dry-run with all cluster scoped resources managed by the Argo App as having a target of nil, but the non-cluster scoped (namespaced) resources rendered just fine. So, Argo pruned the cluster scoped resources like crds, cluster-role bindings, cluster roles.

ojasgo commented 1 month ago

Yep. Seeing this issue as well. In my case, it happened when trying to update an object which requires recreation, where simple update wont work. Ideally, it should have picked particular resource, but it deleted the whole appset.

ncdc commented 1 week ago

FYI, the incorrect deletions of cluster-scoped resources should be fixed by https://github.com/argoproj/gitops-engine/pull/597, assuming it's approved, merged, and pulled into ArgoCD.