argoproj / argo-cd

Declarative Continuous Deployment for Kubernetes
https://argo-cd.readthedocs.io
Apache License 2.0
17.85k stars 5.45k forks source link

Retrying failed sync's block newer commits; how to achieve declarative, level based gitops semantics? #11494

Open jlewi opened 1 year ago

jlewi commented 1 year ago

I originally raised this in https://github.com/argoproj/argo-cd/discussions/11276 I have since observed this behavior repeatedly and so I'm raising it as an issue.

Checklist:

Describe the bug

ArgoCD will keep retrying the same commit that fails to sync properly even if there is a newer commit that fixes the sync.

To Reproduce

  1. Configure ArgoCD for infinite retries (see Application.yaml provided below)
  2. Add a YAML file containing a non-K8s resource to the repository being applied by Argo (see below).
  3. ArgoCD will try and fail to apply the resource
    • In this particular case it fails to apply the resource because the namespace NAMESPACETHATDOESNOTEXIST doesn't exist
    • My application.yaml deliberately sets the namespace to a non-existent namespace because if any of the resources don't explicitly set the namespace I want that to result in an error rather than defaulting to a value set by ArgoCD
  4. Now delete the YAML file containing the invalid resource and create a new commit
  5. ArgoCD keeps trying to apply the previous commit which will never succeed rather than moving onto the new commit.

Expected behavior

I expect ArgoCD's semantics to be level-based, declarative.

Level based means I'd expect ArgoCD keeps retrying a failed sync until either that sync succeeds or there is a newer commit. For example, suppose I have an application that is creating a custom resource and the CRD hasn't been created yet (and is created by a separate process) or similarly the namespace doesn't exist and is created by separate process. Level based to me implies ArgoCD will keep retrying periodically so that in the event those issues are resolved the application gets created.

Declarative means I expect the desired state of the world to be the latest commit on the branch. Retrying an older commit once there is a newer commit violates that commit. Importantly, my expectation for gitops is that if there is a problem with a configuration I can fix it by pushing a new commit; if broken commits can block newer commits unless terminated then it doesn't seem like I can fix broken configurations just by pushing a new commit.

Do those expectations align with ArgoCD's? If they do, have I somehow misconfigured ArgoCD in order to achieve those semantics?

Screenshots

Version

argocd: v2.5.2+148d8da
  BuildDate: 2022-11-07T17:03:01Z
  GitCommit: 148d8da7a996f6c9f4d102fdd8e688c2ff3fd8c7
  GitTreeState: clean
  GoVersion: go1.18.7
  Compiler: gc
  Platform: darwin/arm64

Here's my ArgoCD application

apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: app
  namespace: argocd
spec:
  # TODO(jeremy): What is project?
  project: default
  source:
    repoURL: git@github.com:jlewi/manifests.git
    targetRevision: HEAD
    path: app
    directory:
      # recurse is not true by default
      recurse: true
  destination:
    server: https://kubernetes.default.svc
    # The namespace will only be set for namespace-scoped resources that have not set a value for .metadata.namespace
    # We intentionally set this to a namespace that doesn't exist because if resources
    # don't have .metadata.namespace set that should be an error
    namespace: NAMESPACETHATDOESNTEXIST
  syncPolicy:
    automated: # automated sync by default retries failed attempts 5 times with following delays between attempts ( 5s, 10s, 20s, 40s, 80s ); retry controlled using `retry` field.
      prune: true # Specifies if resources should be pruned during auto-syncing ( false by default ).
      selfHeal: true # Specifies if partial app sync should be executed when resources are changed only in target Kubernetes cluster and no git change detected ( false by default ).
      allowEmpty: true # Allows deleting all application resources during automatic syncing ( false by default ).
    retry:
        limit: -1 # number of failed sync attempt retries; unlimited number of attempts if less than 0
        backoff:
          duration: 5s # the amount to back off. Default unit is seconds, but could also be a duration (e.g. "2m", "1h")
          factor: 2 # a factor to multiply the base duration after each failed retry
          maxDuration: 3m # the maximum amount of time allowed for the backoff strategy

Here's an example of a YAML file that I included in my ArgoCD sync'd repository. It is not a K8s resource so it fails to be applied.

apiVersion: mlp.primer.ai/v1alpha1
kind: ManifestSync
metadata:
  name: app-dev-takeover
  labels: {}
spec:
  sourceRepo:
    org: jlewi
    repo: flock
    branch: dev-takeover
  forkRepo:
    org: jlewi
    repo: manifests
    branch: hydros/dev-takeover
  destRepo:
    org: jlewi
    repo: manifests
    branch: main
  sourcePath: manifests/app
  selector:
    matchLabels:
      app: app
      environment: dev
  destPath: app
status:
  sourceUrl: https://github.com/jlewi/manifests/tree/8a17313
  sourceCommit: 8a17313
crenshaw-dev commented 1 year ago

My intuition is the same as yours. I feel like limit: -1 should mean "retry indefinitely or until there's a new commit."

tybook commented 1 year ago

Any update here? Seems like pretty critical functionality.

~In my limited testing on v2.8.0 I think the in-progress broken commit's sync does timeout after 10 minutes, but I can't find where this timeout is defined.~ EDIT: Nevermind, apparently this was a fluke; I can't reproduce.

In any case a fixed timeout is unsatisfactory. Upon a new commit being pushed, the auto-sync logic should immediately terminate any in-progress syncs and start a new one. For now the manual workaround is to click the TERMINATE button in the UI or run argocd app terminate-op <APPLICATION> with the CLI.

aslafy-z commented 1 year ago

I found that @Sayrus wrote a workaround to address this issue: https://github.com/Sayrus/argo-cd/commit/817bc3449768021d0d5ad7f1ce7510bcd9d2f486 Combined with auto-sync, it makes ArgoCD eventually deploys changes from new commits by forcing a Terminate after Xs of unsuccessful sync. This makes ArgoCD refresh and then sync again. This is not ideal but may unlock lot of folks.

erikschul commented 1 year ago

As mentioned in https://github.com/argoproj/argo-cd/issues/15642 my view is that it would be better solved with an algorithm that detects lack of progress combined with a newer commit. In my case, the lack of progress was obvious, i.e. a rejected apply due to invalid YAML, which should be caught by the algorithm, but I imagine it can be problematic to terminate early when dealing with finalizers and such.

tybook commented 1 year ago

I found that @Sayrus wrote a workaround to address this issue: Sayrus@817bc34 Combined with auto-sync, it makes ArgoCD eventually deploys changes from new commits by forcing a Terminate after Xs of unsuccessful sync. This makes ArgoCD refresh and then sync again. This is not ideal but may unlock lot of folks.

Hm so it's a non-configurable 24 hour timeout? I guess it's better than nothing, but yeah not ideal.

I've worked around this for now by defining a Github Workflow that triggers on any push to the default branch of my ArgoCD-managed manifest monorepo. The workflow uses the argocd CLI to look for any Applications that have in-progress syncs targeting a commit other than the latest commit in the repo and runs argocd app terminate-op on them. Combined with auto-sync and the ApplyOutOfSyncOnly=false sync option, I believe this makes for a reasonably safe and responsive workaround.

Sayrus commented 1 year ago

I found that @Sayrus wrote a workaround to address this issue: Sayrus@817bc34 Combined with auto-sync, it makes ArgoCD eventually deploys changes from new commits by forcing a Terminate after Xs of unsuccessful sync. This makes ArgoCD refresh and then sync again. This is not ideal but may unlock lot of folks.

Hm so it's a non-configurable 24 hour timeout? I guess it's better than nothing, but yeah not ideal.

I've worked around this for now by defining a Github Workflow that triggers on any push to the default branch of my ArgoCD-managed manifest monorepo. The workflow uses the argocd CLI to look for any Applications that have in-progress syncs targeting a commit other than the latest commit in the repo and runs argocd app terminate-op on them. Combined with auto-sync and the ApplyOutOfSyncOnly=false sync option, I believe this makes for a reasonably safe and responsive workaround.

We currently have external logic to terminate sync with the CLI too which is why I didn't continue on my PoC. The timeout can easily be made configurable since we have access to the App Spec within that context.

jannfis commented 1 year ago

Related: https://github.com/argoproj/argo-cd/issues/15624

jannfis commented 1 year ago

My intuition is the same as yours. I feel like limit: -1 should mean "retry indefinitely or until there's a new commit."

Yes, I'd agree. Right now, the same parameters are re-used for any follow up sync operation (i.e. target revision, all source parameters, etc), which makes retries useless in some scenarios. We probably need to perform a refresh before the next retry, and update the operation with any new information (new commit SHA, changed source parameters, etc).

jannfis commented 1 year ago

I have a fix for this, which I want to discuss in today's contributor's meeting

tybook commented 1 year ago

Glad to hear you have an idea. Did anything come out of the meeting?

On Thu, Sep 28, 2023, 21:14 jannfis @.***> wrote:

I have a fix for this, which I want to discuss in today's contributor's meeting

— Reply to this email directly, view it on GitHub https://github.com/argoproj/argo-cd/issues/11494#issuecomment-1739499714, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAVTC7G67J7QM4PQQFB7SDTX4WH6TANCNFSM6AAAAAASPBCTQU . You are receiving this because you are subscribed to this thread.Message ID: @.***>

aslafy-z commented 1 year ago

Here is the recording: https://youtu.be/baIX9Bk6f5w?t=1173. Alex was absent to the meeting and we'd like his take on this. What I understand is we'd prefer to Terminate a sync when a new revision is found instead of mutating the current retry with the new revision. This way, the whole sync process is redone, history is consistent and hooks are replayed if they change for example. I opened a draft PR that does exactly that in here: https://github.com/argoproj/argo-cd/pull/15603. I'd love it to work when the sync is in the hook phase too, but I'm unable to locate where in the code this is executed. @alexec @crenshaw-dev @jannfis Please have a look to the PR :pray:

mpiercy827 commented 12 months ago

@aslafy-z Any update on this issue? We've been encountering this issue recently and it would be nice to have new commits roll out and fix failed syncs 👍

aslafy-z commented 11 months ago

@mpiercy827 I'd love this PR to move forward. I removed the Draft tag and would love some reviews! @alexec @crenshaw-dev @jannfis please have a look! :pray:

jbartyze-rh commented 9 months ago

Here is sample workaround script for openshift-gitops-operator 1.11 that translates to ArgoCD 2.9.2.

I did not see it anywhere on the internet or in this issue, so leaving it here for someone who is struggling with this problem.

Inject into cronjob with oc/kubectl and you are good to go.

#!/bin/bash

# Script to compare current and desired sync revisions of ArgoCD applications and terminate operations where necessary,
# skipping applications where revision information is empty.

# Get list of all ArgoCD applications in the '<argocd_namespace>' namespace
applications=$(oc get applications -n <argocd_namespace> -o json | jq -r '.items[] | select(.status.operationState.phase == "Running") | .metadata.name')

for app in $applications; do
  echo "Processing application: $app"

  # Get the currently running sync revision
  current_revision=$(oc get application "$app" -n <argocd_namespace> -o jsonpath='{.operation.sync.revision}')

  # Get the desired sync revision
  desired_revision=$(oc get application "$app" -n <argocd_namespace> -o jsonpath='{.status.sync.revision}')

  echo "Current revision for $app: $current_revision"
  echo "Desired revision for $app: $desired_revision"

  # Skip the application if either the current or desired revision is empty
  if [ -z "$current_revision" ] || [ -z "$desired_revision" ]; then
    echo "Skipping application $app due to missing revision information."
    continue
  fi

  # Compare the two revisions
  if [ "$current_revision" != "$desired_revision" ]; then
    echo "Revision mismatch detected for application: $app. Terminating operation."
    # Terminate the operation for the application
     oc exec <controller_pod_name> -n <argocd_namespace> -- argocd app terminate-op "$app" --core
  else
    echo "No revision mismatch for application: $app. No action needed."
  fi
done
javiervelamindcurv commented 9 months ago

Great job! Any update?

Thanks!

sherifabdlnaby commented 7 months ago

I was surprised this is not the behavior in ArgoCD! Any chance we can get this fixed? how can I help?

sherifabdlnaby commented 7 months ago

I think ArgoCD need two knobs to better handle Syncing.

  1. A Sync timeout feature so that jobs that are stuck indefinitely can eventually fail. ( https://github.com/argoproj/argo-cd/issues/16489 and https://github.com/argoproj/argo-cd/issues/6055 )
  2. A Flag to retry the newest version after a Sync has failed (most probably due to timeout) instead of trying the same version regardless of the retry count.

I believe this is the most intuitive approach and it's simpler because a Sync attempt is only attempting a single SHA; and you won't terminate syncs while they're running regardless they're stuck or not (a preSync Job can take a while but you don't really want to kill it because a new commit was pushed). And as a user I have control over the timeouts.

phyzical commented 1 month ago

@sherifabdlnaby wouldn't 2 automagically occur after 1 happens?

As this is what happen if you click cancel manually.

I guess if we did need two paths it would be something like sync-timeout and sync-timeout-condition

where sync-timeout-condition defaults to 'cancel' instead of 'fail' to match current convention when this timeout issue occurs

blixem777 commented 1 month ago

any progress on this?

jlewi commented 1 month ago

To summarize this ticket. There is a PR open https://github.com/argoproj/argo-cd/pull/15603 that is about 1 year old that IUC implements the approach agreed upon at the community meeting. https://github.com/argoproj/argo-cd/issues/11494#issuecomment-1742873954

It looks the PR is stuck on some failing tests from July of this year https://github.com/argoproj/argo-cd/pull/15603#discussion_r1697734733 with a suggestion on how to fix the test.

So it seems like the path forward would be for someone to pick up that PR to get the tests to pass and then hopefully the ArgoCD maintainers will approve it.

jastBytes commented 3 weeks ago

I've updated the script by @jbartyze-rh (thx!) to work out with my ArgoCD v2.10.7 and skip apps that are in state "Synced":

#!/usr/bin/env bash

set -euo pipefail

# Script to compare current and desired sync revisions of ArgoCD applications and terminate operations where necessary,
# skipping applications where revision information is empty.

# Get list of all ArgoCD applications in the 'argocd' namespace
applications=$(kubectl get applications -n argocd -o json | jq -r '.items[] | select(.status.operationState.phase == "Running") | .metadata.name')

for app in $applications; do
  # Get the current status
  current_status=$(kubectl get application "$app" -n argocd -o jsonpath='{.status.sync.status}')

  if [[ "$current_status" == "Synced" ]]; then
    echo "Skipping application $app due to status is synced."
    continue
  fi

  echo "Processing application: $app"

  # Get the currently running sync revision
  current_revision=$(kubectl get application "$app" -n argocd -o jsonpath='{.operation.sync.revisions}')

  # Get the desired sync revision
  desired_revision=$(kubectl get application "$app" -n argocd -o jsonpath='{.status.sync.revisions}')

  echo "Current revision for $app: $current_revision"
  echo "Desired revision for $app: $desired_revision"

  # Skip the application if either the current or desired revision is empty
  if [ -z "$current_revision" ] || [ -z "$desired_revision" ]; then
    echo "Skipping application $app due to missing revision information."
    continue
  fi

  # Compare the two revisions
  if [ "$current_revision" != "$desired_revision" ]; then
    echo "Revision mismatch detected for application: $app. Terminating operation."
    # Terminate the operation for the application
    kubectl exec argocd-application-controller-0 -n argocd -- argocd app terminate-op "$app" --core
  else
    echo "No revision mismatch for application: $app. No action needed."
  fi
done
jastBytes commented 3 weeks ago

Here's another version to have it as a cronjob. All you need is a container image suitable to execute. In my case I've created one having the following Dockerfile:

FROM alpine:3.20.3

ARG ARGOCDCLI_VERSION=v2.12.5
ARG KUBECTL_VERSION=v1.30.5

# Add necessary tools
RUN apk add -u --no-cache curl openssl bash jq

RUN curl -SL https://github.com/argoproj/argo-cd/releases/download/$ARGOCDCLI_VERSION/argocd-linux-amd64 -o /usr/local/bin/argocd && chmod +x /usr/local/bin/argocd
RUN curl -SL https://dl.k8s.io/release/${KUBECTL_VERSION}/bin/linux/amd64/kubectl -o /usr/local/bin/kubectl && chmod +x /usr/local/bin/kubectl

# run as non-root
RUN addgroup -g 1000 -S argocd && adduser -u 1000 -S argocd -G argocd && chown -R argocd:argocd /home/argocd && chmod 0770 /home/argocd
USER argocd

WORKDIR /home/argocd
ENTRYPOINT ["/usr/local/bin/argocd"]

And here's the K8s part using the image above. This must be run in argocd namespace to have the serviceaccount present:

---
apiVersion: batch/v1
kind: CronJob
metadata:
  name: argocd-terminate-operations
spec:
  schedule: "*/3 * * * *" # At every 3rd minute
  concurrencyPolicy: Forbid
  jobTemplate:
    spec:
      template:
        spec:
          containers:
          - name: terminate
            image: myfancyregsitry/argocdcli-image:v0.0.1
            command: ["/script/terminate.sh"]
            volumeMounts:
              - name: script
                mountPath: "/script"
          restartPolicy: Never
          automountServiceAccountToken: true
          serviceAccount: argocd-application-controller
          serviceAccountName: argocd-application-controller
          volumes:
            - name: script
              configMap:
                name: argocd-terminate-operations-script
                defaultMode: 0555
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: argocd-terminate-operations-script
data:
  terminate.sh: |
    #!/usr/bin/env bash

    set -euo pipefail

    # Script to compare current and desired sync revisions of ArgoCD applications and terminate operations where necessary,
    # skipping applications where revision information is empty.

    # Get list of all ArgoCD applications in the 'argocd' namespace
    applications=$(kubectl get applications -n argocd -o json | jq -r '.items[] | select(.status.operationState.phase == "Running") | .metadata.name')

    for app in $applications; do
      # Get the current status
      current_status=$(kubectl get application "$app" -n argocd -o jsonpath='{.status.sync.status}')

      if [[ "$current_status" == "Synced" ]]; then
        echo "Skipping application $app due to synced status."
        continue
      fi

      echo "Processing application: $app"

      # Get the currently running sync revision
      current_revision=$(kubectl get application "$app" -n argocd -o jsonpath='{.operation.sync.revisions}')

      # Get the desired sync revision
      desired_revision=$(kubectl get application "$app" -n argocd -o jsonpath='{.status.sync.revisions}')

      # Skip the application if either the current or desired revision is empty
      if [ -z "$current_revision" ] || [ -z "$desired_revision" ]; then
        echo "Skipping application $app due to missing revision information."
        continue
      fi

      # Compare the two revisions
      if [ "$current_revision" != "$desired_revision" ]; then
        echo "Current revision for $app: $current_revision"
        echo "Desired revision for $app: $desired_revision"
        echo "Revision mismatch detected for application: $app. Terminating operation."
        # Terminate the operation for the application
        argocd app terminate-op "$app" --core
      else
        echo "No revision mismatch for application: $app. No action needed."
      fi
    done
jbartyze-rh commented 3 weeks ago

One improvement I found after using this script for longer time in my current engagement is changing below.

oc/kubectl get application to oc/kubectl get application.argoproj.io

We faced some issues with overlap of the aliases of different CR, so it is good to be explicit here in API choice.