Open jlewi opened 1 year ago
My intuition is the same as yours. I feel like limit: -1
should mean "retry indefinitely or until there's a new commit."
Any update here? Seems like pretty critical functionality.
~In my limited testing on v2.8.0 I think the in-progress broken commit's sync does timeout after 10 minutes, but I can't find where this timeout is defined.~ EDIT: Nevermind, apparently this was a fluke; I can't reproduce.
In any case a fixed timeout is unsatisfactory. Upon a new commit being pushed, the auto-sync logic should immediately terminate any in-progress syncs and start a new one. For now the manual workaround is to click the TERMINATE
button in the UI or run argocd app terminate-op <APPLICATION>
with the CLI.
I found that @Sayrus wrote a workaround to address this issue: https://github.com/Sayrus/argo-cd/commit/817bc3449768021d0d5ad7f1ce7510bcd9d2f486 Combined with auto-sync, it makes ArgoCD eventually deploys changes from new commits by forcing a Terminate after Xs of unsuccessful sync. This makes ArgoCD refresh and then sync again. This is not ideal but may unlock lot of folks.
As mentioned in https://github.com/argoproj/argo-cd/issues/15642
my view is that it would be better solved with an algorithm that detects lack of progress combined with a newer commit.
In my case, the lack of progress was obvious, i.e. a rejected apply
due to invalid YAML, which should be caught by the algorithm, but I imagine it can be problematic to terminate early when dealing with finalizers and such.
I found that @Sayrus wrote a workaround to address this issue: Sayrus@817bc34 Combined with auto-sync, it makes ArgoCD eventually deploys changes from new commits by forcing a Terminate after Xs of unsuccessful sync. This makes ArgoCD refresh and then sync again. This is not ideal but may unlock lot of folks.
Hm so it's a non-configurable 24 hour timeout? I guess it's better than nothing, but yeah not ideal.
I've worked around this for now by defining a Github Workflow that triggers on any push to the default branch of my ArgoCD-managed manifest monorepo. The workflow uses the argocd
CLI to look for any Applications that have in-progress syncs targeting a commit other than the latest commit in the repo and runs argocd app terminate-op
on them. Combined with auto-sync and the ApplyOutOfSyncOnly=false
sync option, I believe this makes for a reasonably safe and responsive workaround.
I found that @Sayrus wrote a workaround to address this issue: Sayrus@817bc34 Combined with auto-sync, it makes ArgoCD eventually deploys changes from new commits by forcing a Terminate after Xs of unsuccessful sync. This makes ArgoCD refresh and then sync again. This is not ideal but may unlock lot of folks.
Hm so it's a non-configurable 24 hour timeout? I guess it's better than nothing, but yeah not ideal.
I've worked around this for now by defining a Github Workflow that triggers on any push to the default branch of my ArgoCD-managed manifest monorepo. The workflow uses the
argocd
CLI to look for any Applications that have in-progress syncs targeting a commit other than the latest commit in the repo and runsargocd app terminate-op
on them. Combined with auto-sync and theApplyOutOfSyncOnly=false
sync option, I believe this makes for a reasonably safe and responsive workaround.
We currently have external logic to terminate sync with the CLI too which is why I didn't continue on my PoC. The timeout can easily be made configurable since we have access to the App Spec within that context.
My intuition is the same as yours. I feel like
limit: -1
should mean "retry indefinitely or until there's a new commit."
Yes, I'd agree. Right now, the same parameters are re-used for any follow up sync operation (i.e. target revision, all source parameters, etc), which makes retries useless in some scenarios. We probably need to perform a refresh before the next retry, and update the operation with any new information (new commit SHA, changed source parameters, etc).
I have a fix for this, which I want to discuss in today's contributor's meeting
Glad to hear you have an idea. Did anything come out of the meeting?
On Thu, Sep 28, 2023, 21:14 jannfis @.***> wrote:
I have a fix for this, which I want to discuss in today's contributor's meeting
— Reply to this email directly, view it on GitHub https://github.com/argoproj/argo-cd/issues/11494#issuecomment-1739499714, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAVTC7G67J7QM4PQQFB7SDTX4WH6TANCNFSM6AAAAAASPBCTQU . You are receiving this because you are subscribed to this thread.Message ID: @.***>
Here is the recording: https://youtu.be/baIX9Bk6f5w?t=1173. Alex was absent to the meeting and we'd like his take on this. What I understand is we'd prefer to Terminate a sync when a new revision is found instead of mutating the current retry with the new revision. This way, the whole sync process is redone, history is consistent and hooks are replayed if they change for example. I opened a draft PR that does exactly that in here: https://github.com/argoproj/argo-cd/pull/15603. I'd love it to work when the sync is in the hook phase too, but I'm unable to locate where in the code this is executed. @alexec @crenshaw-dev @jannfis Please have a look to the PR :pray:
@aslafy-z Any update on this issue? We've been encountering this issue recently and it would be nice to have new commits roll out and fix failed syncs 👍
@mpiercy827 I'd love this PR to move forward. I removed the Draft tag and would love some reviews! @alexec @crenshaw-dev @jannfis please have a look! :pray:
Here is sample workaround script for openshift-gitops-operator 1.11 that translates to ArgoCD 2.9.2.
I did not see it anywhere on the internet or in this issue, so leaving it here for someone who is struggling with this problem.
Inject into cronjob with oc/kubectl and you are good to go.
#!/bin/bash
# Script to compare current and desired sync revisions of ArgoCD applications and terminate operations where necessary,
# skipping applications where revision information is empty.
# Get list of all ArgoCD applications in the '<argocd_namespace>' namespace
applications=$(oc get applications -n <argocd_namespace> -o json | jq -r '.items[] | select(.status.operationState.phase == "Running") | .metadata.name')
for app in $applications; do
echo "Processing application: $app"
# Get the currently running sync revision
current_revision=$(oc get application "$app" -n <argocd_namespace> -o jsonpath='{.operation.sync.revision}')
# Get the desired sync revision
desired_revision=$(oc get application "$app" -n <argocd_namespace> -o jsonpath='{.status.sync.revision}')
echo "Current revision for $app: $current_revision"
echo "Desired revision for $app: $desired_revision"
# Skip the application if either the current or desired revision is empty
if [ -z "$current_revision" ] || [ -z "$desired_revision" ]; then
echo "Skipping application $app due to missing revision information."
continue
fi
# Compare the two revisions
if [ "$current_revision" != "$desired_revision" ]; then
echo "Revision mismatch detected for application: $app. Terminating operation."
# Terminate the operation for the application
oc exec <controller_pod_name> -n <argocd_namespace> -- argocd app terminate-op "$app" --core
else
echo "No revision mismatch for application: $app. No action needed."
fi
done
Great job! Any update?
Thanks!
I was surprised this is not the behavior in ArgoCD! Any chance we can get this fixed? how can I help?
I think ArgoCD need two knobs to better handle Syncing.
timeout
feature so that jobs that are stuck indefinitely can eventually fail
. ( https://github.com/argoproj/argo-cd/issues/16489 and https://github.com/argoproj/argo-cd/issues/6055 )I believe this is the most intuitive approach and it's simpler because a Sync attempt is only attempting a single SHA; and you won't terminate syncs while they're running regardless they're stuck or not (a preSync Job can take a while but you don't really want to kill it because a new commit was pushed). And as a user I have control over the timeouts.
@sherifabdlnaby wouldn't 2 automagically occur after 1 happens?
As this is what happen if you click cancel manually.
I guess if we did need two paths it would be something like sync-timeout
and sync-timeout-condition
where sync-timeout-condition
defaults to 'cancel' instead of 'fail' to match current convention when this timeout issue occurs
any progress on this?
To summarize this ticket. There is a PR open https://github.com/argoproj/argo-cd/pull/15603 that is about 1 year old that IUC implements the approach agreed upon at the community meeting. https://github.com/argoproj/argo-cd/issues/11494#issuecomment-1742873954
It looks the PR is stuck on some failing tests from July of this year https://github.com/argoproj/argo-cd/pull/15603#discussion_r1697734733 with a suggestion on how to fix the test.
So it seems like the path forward would be for someone to pick up that PR to get the tests to pass and then hopefully the ArgoCD maintainers will approve it.
I've updated the script by @jbartyze-rh (thx!) to work out with my ArgoCD v2.10.7 and skip apps that are in state "Synced":
#!/usr/bin/env bash
set -euo pipefail
# Script to compare current and desired sync revisions of ArgoCD applications and terminate operations where necessary,
# skipping applications where revision information is empty.
# Get list of all ArgoCD applications in the 'argocd' namespace
applications=$(kubectl get applications -n argocd -o json | jq -r '.items[] | select(.status.operationState.phase == "Running") | .metadata.name')
for app in $applications; do
# Get the current status
current_status=$(kubectl get application "$app" -n argocd -o jsonpath='{.status.sync.status}')
if [[ "$current_status" == "Synced" ]]; then
echo "Skipping application $app due to status is synced."
continue
fi
echo "Processing application: $app"
# Get the currently running sync revision
current_revision=$(kubectl get application "$app" -n argocd -o jsonpath='{.operation.sync.revisions}')
# Get the desired sync revision
desired_revision=$(kubectl get application "$app" -n argocd -o jsonpath='{.status.sync.revisions}')
echo "Current revision for $app: $current_revision"
echo "Desired revision for $app: $desired_revision"
# Skip the application if either the current or desired revision is empty
if [ -z "$current_revision" ] || [ -z "$desired_revision" ]; then
echo "Skipping application $app due to missing revision information."
continue
fi
# Compare the two revisions
if [ "$current_revision" != "$desired_revision" ]; then
echo "Revision mismatch detected for application: $app. Terminating operation."
# Terminate the operation for the application
kubectl exec argocd-application-controller-0 -n argocd -- argocd app terminate-op "$app" --core
else
echo "No revision mismatch for application: $app. No action needed."
fi
done
Here's another version to have it as a cronjob. All you need is a container image suitable to execute. In my case I've created one having the following Dockerfile:
FROM alpine:3.20.3
ARG ARGOCDCLI_VERSION=v2.12.5
ARG KUBECTL_VERSION=v1.30.5
# Add necessary tools
RUN apk add -u --no-cache curl openssl bash jq
RUN curl -SL https://github.com/argoproj/argo-cd/releases/download/$ARGOCDCLI_VERSION/argocd-linux-amd64 -o /usr/local/bin/argocd && chmod +x /usr/local/bin/argocd
RUN curl -SL https://dl.k8s.io/release/${KUBECTL_VERSION}/bin/linux/amd64/kubectl -o /usr/local/bin/kubectl && chmod +x /usr/local/bin/kubectl
# run as non-root
RUN addgroup -g 1000 -S argocd && adduser -u 1000 -S argocd -G argocd && chown -R argocd:argocd /home/argocd && chmod 0770 /home/argocd
USER argocd
WORKDIR /home/argocd
ENTRYPOINT ["/usr/local/bin/argocd"]
And here's the K8s part using the image above. This must be run in argocd namespace to have the serviceaccount present:
---
apiVersion: batch/v1
kind: CronJob
metadata:
name: argocd-terminate-operations
spec:
schedule: "*/3 * * * *" # At every 3rd minute
concurrencyPolicy: Forbid
jobTemplate:
spec:
template:
spec:
containers:
- name: terminate
image: myfancyregsitry/argocdcli-image:v0.0.1
command: ["/script/terminate.sh"]
volumeMounts:
- name: script
mountPath: "/script"
restartPolicy: Never
automountServiceAccountToken: true
serviceAccount: argocd-application-controller
serviceAccountName: argocd-application-controller
volumes:
- name: script
configMap:
name: argocd-terminate-operations-script
defaultMode: 0555
---
apiVersion: v1
kind: ConfigMap
metadata:
name: argocd-terminate-operations-script
data:
terminate.sh: |
#!/usr/bin/env bash
set -euo pipefail
# Script to compare current and desired sync revisions of ArgoCD applications and terminate operations where necessary,
# skipping applications where revision information is empty.
# Get list of all ArgoCD applications in the 'argocd' namespace
applications=$(kubectl get applications -n argocd -o json | jq -r '.items[] | select(.status.operationState.phase == "Running") | .metadata.name')
for app in $applications; do
# Get the current status
current_status=$(kubectl get application "$app" -n argocd -o jsonpath='{.status.sync.status}')
if [[ "$current_status" == "Synced" ]]; then
echo "Skipping application $app due to synced status."
continue
fi
echo "Processing application: $app"
# Get the currently running sync revision
current_revision=$(kubectl get application "$app" -n argocd -o jsonpath='{.operation.sync.revisions}')
# Get the desired sync revision
desired_revision=$(kubectl get application "$app" -n argocd -o jsonpath='{.status.sync.revisions}')
# Skip the application if either the current or desired revision is empty
if [ -z "$current_revision" ] || [ -z "$desired_revision" ]; then
echo "Skipping application $app due to missing revision information."
continue
fi
# Compare the two revisions
if [ "$current_revision" != "$desired_revision" ]; then
echo "Current revision for $app: $current_revision"
echo "Desired revision for $app: $desired_revision"
echo "Revision mismatch detected for application: $app. Terminating operation."
# Terminate the operation for the application
argocd app terminate-op "$app" --core
else
echo "No revision mismatch for application: $app. No action needed."
fi
done
One improvement I found after using this script for longer time in my current engagement is changing below.
oc/kubectl get application to oc/kubectl get application.argoproj.io
We faced some issues with overlap of the aliases of different CR, so it is good to be explicit here in API choice.
I originally raised this in https://github.com/argoproj/argo-cd/discussions/11276 I have since observed this behavior repeatedly and so I'm raising it as an issue.
Checklist:
argocd version
.Describe the bug
ArgoCD will keep retrying the same commit that fails to sync properly even if there is a newer commit that fixes the sync.
To Reproduce
NAMESPACETHATDOESNOTEXIST
doesn't existExpected behavior
I expect ArgoCD's semantics to be level-based, declarative.
Level based means I'd expect ArgoCD keeps retrying a failed sync until either that sync succeeds or there is a newer commit. For example, suppose I have an application that is creating a custom resource and the CRD hasn't been created yet (and is created by a separate process) or similarly the namespace doesn't exist and is created by separate process. Level based to me implies ArgoCD will keep retrying periodically so that in the event those issues are resolved the application gets created.
Declarative means I expect the desired state of the world to be the latest commit on the branch. Retrying an older commit once there is a newer commit violates that commit. Importantly, my expectation for gitops is that if there is a problem with a configuration I can fix it by pushing a new commit; if broken commits can block newer commits unless terminated then it doesn't seem like I can fix broken configurations just by pushing a new commit.
Do those expectations align with ArgoCD's? If they do, have I somehow misconfigured ArgoCD in order to achieve those semantics?
Screenshots
Version
Here's my ArgoCD application
Here's an example of a YAML file that I included in my ArgoCD sync'd repository. It is not a K8s resource so it fails to be applied.