argoproj / argo-cd

Declarative Continuous Deployment for Kubernetes
https://argo-cd.readthedocs.io
Apache License 2.0
17.59k stars 5.36k forks source link

ApplicationSet rolling sync stuck in ArgoCD 2.12 #19535

Open carlosrejano opened 1 month ago

carlosrejano commented 1 month ago

Describe the bug

We have recently upgraded to ArgoCD 2.12. We are using ApplicationSets with RollingSync strategy to deploy our applications. Since the upgrade we have found that the rolling sync gets stuck after performing the first step. This seems to be because the targetVersions in the applicationset status do not get updated with the latest one.

We've found the following issue.

The applicationset app status stays in Pending state even when the app has been synced and is healthy:

- application: step-1-application
  lastTransitionTime: "2024-08-13T15:46:32Z"
  message: Application moved to Pending status, watching for the Application resource to start Progressing.
  status: Pending
  step: "1"
  targetRevisions:
    - d6c3c26af50671c9ec1c0fde89c0100ef7398c09

This is the application status operation state:

finishedAt: "2024-08-14T06:31:12Z"
message: successfully synced (all tasks run)
operation:
  info:
    - name: Reason
      value: ApplicationSet RollingSync triggered a sync of this Application resource.
  initiatedBy:
    automated: true
    username: applicationset-controller
  retry: {}
  sync:
    prune: true
    revision: f3929224827b8c9f9be72d70fbd3e06ba9c8baa8
    syncStrategy:
      hook: {}
phase: Succeeded
startedAt: "2024-08-14T06:31:09Z"
syncResult:
  resources:
    - group: argoproj.io
      hookPhase: Succeeded
      kind: AppProject
      message: appproject.argoproj.io/step-applications unchanged
      name: step-applications
      namespace: ppp-system
      status: Synced
      syncPhase: Sync
      version: v1alpha1
    - group: argoproj.io
      hookPhase: Running
      kind: Application
      message: application.argoproj.io/step-application configured
      name: step-application
      namespace: ppp-system
      status: Synced
      syncPhase: Sync
      version: v1alpha1
  revision: 2fadb0cf3bd1ff6848a096f08ab7d1b3d87c694b
  source:
    helm:
      valuesObject:
        targetRevision: 2fadb0cf3bd1ff6848a096f08ab7d1b3d87c694b
    path: ------
    repoURL: -------
    targetRevision: 2fadb0cf3bd1ff6848a096f08ab7d1b3d87c694b

The latest commit is the one in the application status. The one in the applicationset status is older. I think that's the reason why it never continues because the revisions do not match.

One of the errors in the logs I see is the following, it's happening from time to time:

unable to set application set status: Operation cannot be fulfilled on applicationsets.argoproj.io \"step-applications\": the object has been modified; please apply your changes to the latest version and try again

I believe there is a problem there when trying to update the applicationset status.

Sharing a bit more of information of our use case, the applicationset generates the apps from a github repository which has multiple merges a day, this makes that normally there is a merge and a new generation of apps in the middle of a rolling sync. I think there could be an issue there too.

Expected behavior

The rolling sync should continue properly after the first step.

gmauleon commented 3 weeks ago

I can confirm this seems to happen also in 1.11.7 and probably older versions, we sometimes have applications that stay OutOfSync forever and would not sync with the RollingSync option activated in our applicationset.

Today, we had such an occurence and there are a bunch of applications in "waves" preceding the one for the stuck application that have their status to Pending although they have been healthy/synced for a while.

I forced the status to Progressing for those and everything went back to normal and my stuck OutOfSync application synced automatically.

gmauleon commented 2 weeks ago

https://github.com/argoproj/argo-cd/issues/12202

gmauleon commented 2 weeks ago

I believe you might be right on other potential problems, with multiple commits.

Between the time we store the revisions here and the time we compare it here there could have been other commits/revisions changes when the sync is applied (particularly in the case of a mono repo as source for multi apps for the appset), so it will stay stuck in pending again.

I was wondering if @wparr-circle or @crenshaw-dev can confirm and if they have a take on that? Like would there be way to just detect it's a later sync and progress the status anyway? Maybe this needs a similar feature as the metadata.generation on the application sync status itself to detect if the "sync generation" is higher than the one we stored?

gmauleon commented 2 weeks ago

I might be completely off, but what if we overthinked all this? Applicationset RollingSync is ultimately meant to trigger sync on OutofSync application in a certain order right? And this behavior can still be bypassed by manual operations at any point (I can sync any app I want manually).

So if an app is synced and healthy whatever the status of the app in the applicationset, its job is done isn't it? So should we care at all to check revisions or times, couldn't we just check the application syncstatus and healthstatus?

Here is an over simpified graph, we already have some of those transitions (waiting and progressing)

image

tjamet commented 1 week ago

Indeed, it would be interesting to understand the strategy(ies) that should be followed for the progressive sync, in particular when the ApplicationSet spec is updated or that the targetRevision resolves to a new sha during a rolling sync.

Personally, I mostly foresee 2 strategies

  1. Reset the rolling sync and restart from the first Application This means not all revisions will make it to all applications
  2. Enforce that all applications receives progressively all revisions This means there needs to be a mechanism to maintain a queue of revisions to be applied Main use-case: upgrades. Say commit A bumps a controller that adds a CRD and run a migration script, commit B drops support for the legacy behaviour. You want to ensure that commit A is successfully applied to all clusters before commit B is applied.

With those strategies in mind, what I don't get clear is the signification of ApplicationSet.status.ApplicationStatus[].TargetRevisions. Does it aim at reflecting:

My 2 cents is that to achieve the first strategy, we need to ensure that all applications are effectively healthy with the latest revision. If they are not healthy, we need to go, in order, through all the applications and sync them.

This means a high level algorithm like this:

latestTargetRevisions = getCurrentTargetRevision(appSet)

appSet.status = getAppSetApplicationStatuses(appSet)

for appStatus in appSet.status :
   if appStatus.targetRevisions != latestTargetRevisions:
      syncApp(appStatus.Name)
      break // we only update one app at a time
  if not appStatus.Ready || not appStatus.Healthy:
      break // We hold until the app is ready and healthy

This makes me wonder why the appStatus targetRevision is not updated when the app status is Waiting or Pending. Shouldn't it always be up to date?

Strategy 2 is quite more complex as it requires to store a queue of updates to be performed (and wonder how to handle possible failing deployments (update worked for the first applications, but failed in the middle)

gmauleon commented 1 week ago

Well my point is that the revisions themselves unless I'm mistaken are not important at the applicationset level.

I believe we start using those in 2.12 because we were stuck in pending states in 2.11 sometimes, before we were using sync date/time comparaison, but the transition logic suffer the same problem in both, if someone do a manual sync during a rolling sync on one of the app or if other commit comes in (mono repo setup is highly subject to that), then the condition to get out of the pending state is never met.

To my understanding, the application controller is already handling all the follow-up there and will flag an application as outOfSync if it needs a sync. So the applicationset controller shouldn't care which revision or date, it should only care to trigger a sync on outofsync applications in a certain order.

i.e. I think our only problem is the state machine transitions and conditions.

And we already have a "queue" system, the applicationset application statuses is basically a backup of the queue between reconciliation. It goes like that:

applicationset-queue

One block is basically one reconcile of the applicationset. Pending is the state where a sync is triggered by adding an operation on the application, the application controller is responsible to execute the sync itself. So if at any stage the applicationset is updated or an application health or sync state change, it is reevaluated in each reconciliation so that part should already works.

I believe it's really only the transition conditions that should simply not take into account revision or time but just application outofsync, health and operation statuses.

I started a branch to test those simplifications but couldn't advance much in my calendar 😥

And also...it's highly possible that I completely misunderstood the logic and needs 😁

tjamet commented 1 week ago

In my experience, progressive syncs goal is to reduce the risk of synchronising an application (i.e. propagating an update).

In this sense, I see that revisions matters (the immutable ones like commit sha or helm versions), to understand and control what is being applied/deployed and where.

gmauleon commented 1 week ago

Hum I don't think that's the role of argoCD, in your example:

Say commit A bumps a controller that adds a CRD and run a migration script, commit B drops support for the legacy behaviour. You want to ensure that commit A is successfully applied to all clusters before commit B is applied.

That sounds like you need something above argoCD like kargo.

The progressive sync in the applicationset only add logic on how to deploy applications relative to other applications, not between 2 commits for the same application.

tjamet commented 4 days ago

In case of using git generators to deploy several applications, I would agree.

This said, when using cluster generators to deploy the same application into multiple clusters, I believe ApplicationSet converts themselves de-facto into an application deployment orchestrator. Means it needs to have some mechanisms to control the progression of a release, similarly to what Deployments, StatefulSets or DaemonSet do. This need is, IMO, increased with the use of indirect targetRevisions like HEAD, main or my-branch which may point to different commit sha depending on the time at which the sync action was requested