jannfis commented 1 year ago

Summary

When a sync fails for some reason, and retry is enabled, the Application should be refreshed in between the sync retries instead of re-using the same sync context for each retry.

Motivation

When auto-sync is enabled, a sync that runs with retries enabled may take a long time to complete if there is some kind of unrecoverable error (for example, an erroneous manifest), even if it is already fixed at the source. Even if Argo CD receives a refresh in the time the broken sync is running in its retry-loop, it won't consider any new changes in the repository, ultimately failing auto-sync until the next commit or manual refresh of the application.

Similarly, if self-heal is enabled, the following situation can occur:

Applications manages the Namespace itself and a couple of resources in it
Somebody deletes the Namespace on the cluster
Kubernetes deletes resources within that Namespace
Argo CD receives an event that a managed resource was deleted and starts a sync for self-heal to restore the deleted resource
Meanwhile, Kubernetes also deleted the Namespace
Sync triggered by self-heal will fail, because the target namespace doesn't exist
Argo CD enters sync-retry loop with the same, previous sync-context without considering the deleted Namespace that should be self-healed too
After 5 retries, the sync fails and leaves the cluster in a broken state that needs to be recovered manually, despite auto-sync and self-heal is enabled

Proposal

With sync retries enabled, Argo CD should perform a refresh and update of its sync-context on sync error before proceeding to the next tries. It should:

Pick up any changes to targetRevision made in the source between the time the sync started and the retry and
Pick up any changes surfaced by self-heal to be included in the next retry

oscrx commented 1 year ago

I think this is related to the issue I reported earlier #10303

jannfis commented 1 year ago

@oscrx Yep. It seems to be very related. Thanks for linking.

argoproj / argo-cd

Application should be refreshed in-between sync retries #12904

Summary

Motivation

Proposal