akuity / kargo

Application lifecycle orchestration
https://kargo.akuity.io/
Apache License 2.0
1.75k stars 145 forks source link

Infinite loop when promoting image #2968

Open stephaneetje opened 2 days ago

stephaneetje commented 2 days ago

Hello,

I've been trying to use kargo to only promote image tags (using helm-image-update) but on all my stages, i get an inifnite loop at the argocd-update step.

Here is my stage:

      steps:
        - config:
            checkout:
              - branch: main
                path: ./src
            repoURL: https://github.com/test/test-helm.git
          uses: git-clone
        - as: update-image
          config:
            images:
              - image: europe-docker.pkg.dev/test-001/test/scim
                key: image.tag
                value: Tag
            path: ./src/helm/scim/values-scim-staging-image.yaml
          uses: helm-update-image
        - as: commit
          config:
            messageFromSteps:
              - update-image
            path: ./src
          uses: git-commit
        - config:
            path: ./src
          uses: git-push
        - config:
            apps:
              - name: scim-staging
                sources:
                  - desiredCommitFromStep: commit
                    repoURL: https://github.com/test/test-helm.git
          uses: argocd-update

I also tried removing desiredCommitFromStep but didn't change anything. My warehouse is only subscribing an image repository:

apiVersion: kargo.akuity.io/v1alpha1
kind: Warehouse
metadata:
  annotations:
    argocd.argoproj.io/sync-wave: "3"
    kargo.akuity.io/color: black
  name: scim-prod
  namespace: scim
spec:
  subscriptions:
  - image:
      imageSelectionStrategy: SemVer
      repoURL: europe-docker.pkg.dev/test-01/test/scim
      discoveryLimit: 10
      allowTags: ^v(\d+\.)?(\d+\.)?(\*|\d+)$

You can see in the Argocd app's envents that it is looping

Capture d’écran 2024-11-20 à 12 38 54

Discussion has began on this issue, opening another one as asked by @krancour

krancour commented 2 days ago

@stephaneetje my apologies, as I seem to have misunderstood the nature of your loop. With the previous discussion having taken place in a thread about desiredCommit, I was misled into believing that to have been an actual factor in your problem. If you leave that field alone, as you are currently, AND if there is no commit in your Freight, then when syncing an App, Kargo has no notion of what specific revision that App should observably be synced to once complete.

This means the cause of your loop has nothing to do with a mismatch of the desired revision.

The continuous syncing is probably due to the argocd-update step syncing, then when checking to see if the sync completed, discovering that the most recently completely sync wasn't the one it asked for. It's waiting for its request to complete and seeing a sync completed, but that operation must have been initiated by someone or something else, so it tries again.

Do you happen to have auto-sync enabled on your Apps? It is possible that is what the update step is fighting a losing battle against. We do generally suggest disabling autosync oh Apps when you use Kargo. I will address it in the docs to make that more clear.

If this ends up not being the issue, enabling debug logging on the controller should provide more insight into what conditions are causing the update step to be caught in this sync loop.

stephaneetje commented 2 days ago

I did deactivate autosync. I already enabled DEBUG logs, but it didn't help:

time="2024-11-20T11:44:58Z" level=debug msg="enqueued Promotion for reconciliation" app=scim-staging namespace=scim promotion=staging.01jd4nrbdn19tdabv0t3the2e6.176d90a
time="2024-11-20T11:44:58Z" level=debug msg="reconciling Promotion" namespace=scim promotion=staging.01jd4nrbdn19tdabv0t3the2e6.176d90a
time="2024-11-20T11:44:58Z" level=debug msg="continuing Promotion" freight=176d90ae395fe87b5d0117b96ce24cf135601a11 namespace=scim promotion=staging.01jd4nrbdn19tdabv0t3the2e6.176d90a stage=staging
time="2024-11-20T11:44:58Z" level=debug msg="executing argocd-update promotion step" freight=176d90ae395fe87b5d0117b96ce24cf135601a11 namespace=scim promotion=staging.01jd4nrbdn19tdabv0t3the2e6.176d90a stage=staging
time="2024-11-20T11:44:58Z" level=debug msg="patched Argo CD Application" app=scim-staging freight=176d90ae395fe87b5d0117b96ce24cf135601a11 namespace=scim promotion=staging.01jd4nrbdn19tdabv0t3the2e6.176d90a stage=staging
time="2024-11-20T11:44:58Z" level=debug msg="done executing argocd-update promotion step" freight=176d90ae395fe87b5d0117b96ce24cf135601a11 namespace=scim promotion=staging.01jd4nrbdn19tdabv0t3the2e6.176d90a stage=staging
time="2024-11-20T11:44:58Z" level=debug msg=promotion freight=176d90ae395fe87b5d0117b96ce24cf135601a11 namespace=scim phase="\"Running\"" promotion=staging.01jd4nrbdn19tdabv0t3the2e6.176d90a stage=staging targetFreight=176d90ae395fe87b5d0117b96ce24cf135601a11
time="2024-11-20T11:45:35Z" level=debug msg="enqueued Promotion for reconciliation" app=scim-staging namespace=scim promotion=staging.01jd4nrbdn19tdabv0t3the2e6.176d90a
time="2024-11-20T11:45:35Z" level=debug msg="reconciling Promotion" namespace=scim promotion=staging.01jd4nrbdn19tdabv0t3the2e6.176d90a
time="2024-11-20T11:45:35Z" level=debug msg="continuing Promotion" freight=176d90ae395fe87b5d0117b96ce24cf135601a11 namespace=scim promotion=staging.01jd4nrbdn19tdabv0t3the2e6.176d90a stage=staging
time="2024-11-20T11:45:35Z" level=debug msg="executing argocd-update promotion step" freight=176d90ae395fe87b5d0117b96ce24cf135601a11 namespace=scim promotion=staging.01jd4nrbdn19tdabv0t3the2e6.176d90a stage=staging
time="2024-11-20T11:45:35Z" level=debug msg="patched Argo CD Application" app=scim-staging freight=176d90ae395fe87b5d0117b96ce24cf135601a11 namespace=scim promotion=staging.01jd4nrbdn19tdabv0t3the2e6.176d90a stage=staging
time="2024-11-20T11:45:35Z" level=debug msg="done executing argocd-update promotion step" freight=176d90ae395fe87b5d0117b96ce24cf135601a11 namespace=scim promotion=staging.01jd4nrbdn19tdabv0t3the2e6.176d90a stage=staging
time="2024-11-20T11:45:35Z" level=debug msg=promotion freight=176d90ae395fe87b5d0117b96ce24cf135601a11 namespace=scim phase="\"Running\"" promotion=staging.01jd4nrbdn19tdabv0t3the2e6.176d90a stage=staging targetFreight=176d90ae395fe87b5d0117b96ce24cf135601a11
time="2024-11-20T11:45:50Z" level=debug msg="enqueued Promotion for reconciliation" app=scim-staging namespace=scim promotion=staging.01jd4nrbdn19tdabv0t3the2e6.176d90a
time="2024-11-20T11:45:50Z" level=debug msg="reconciling Promotion" namespace=scim promotion=staging.01jd4nrbdn19tdabv0t3the2e6.176d90a
time="2024-11-20T11:45:50Z" level=debug msg="continuing Promotion" freight=176d90ae395fe87b5d0117b96ce24cf135601a11 namespace=scim promotion=staging.01jd4nrbdn19tdabv0t3the2e6.176d90a stage=staging
time="2024-11-20T11:45:50Z" level=debug msg="executing argocd-update promotion step" freight=176d90ae395fe87b5d0117b96ce24cf135601a11 namespace=scim promotion=staging.01jd4nrbdn19tdabv0t3the2e6.176d90a stage=staging
time="2024-11-20T11:45:50Z" level=debug msg="patched Argo CD Application" app=scim-staging freight=176d90ae395fe87b5d0117b96ce24cf135601a11 namespace=scim promotion=staging.01jd4nrbdn19tdabv0t3the2e6.176d90a stage=staging
time="2024-11-20T11:45:50Z" level=debug msg="done executing argocd-update promotion step" freight=176d90ae395fe87b5d0117b96ce24cf135601a11 namespace=scim promotion=staging.01jd4nrbdn19tdabv0t3the2e6.176d90a stage=staging
time="2024-11-20T11:45:50Z" level=debug msg=promotion freight=176d90ae395fe87b5d0117b96ce24cf135601a11 namespace=scim phase="\"Running\"" promotion=staging.01jd4nrbdn19tdabv0t3the2e6.176d90a stage=staging targetFreight=176d90ae395fe87b5d0117b96ce24cf135601a11

EDIT: I verified and sync is comming from Kargo (INITIATED BY: automated sync policy)

Capture d’écran 2024-11-20 à 14 30 50
krancour commented 1 day ago

I verified and sync is comming from Kargo

Oh. I had no doubt of that.

The mystery is why Kargo is syncing over and over again -- which would usually be because:

  1. It initiates a sync operation
  2. It follows up on the status of the last operation
  3. Finds something unsatisfactory about the status of the last operation
  4. Tries again

I'm trying to piece together what's going wrong at step 3 and it was conjecture that something besides Kargo might (additionally) be initiating sync operations.

I am attempting to replicate this.

hiddeco commented 1 day ago

Running a build with #2936 in it should make it much easier to trace the reason without having to try to reproduce it.

krancour commented 1 day ago

Also, I have tried quite hard to replicate this.

I have 13 Projects x 3 Stages per project, each with an associated Argo CD App.

All 39 Stages are reading from and writing to just one single branch.

Here's a gist of what one Project looks like:

https://gist.github.com/krancour/dafa7ded1b4db4c0683c05e7f01129e8

There are 12 more just like it.

I've been unable to trigger any kind of infinite sync loop.

I'll note the problem I did run into:

With lots and lots of Promos running concurrently, all racing to push to the same remote branch, unsurprisingly, some of those failed because in the space between a Promotion checking out the branch and pushing to it, some other Promotion pushed its own commit. Re-trying any Promotion that failed for these reasons put things back on track. This will be a problem for anyone who has many Stages making updates to a single branch and a large volume of Promotions. (Add this to the list of problems that are solved for free by embracing the notion of branches as env-specific config storage.)

Not to get too sidetracked here, but there's a limited amount we can do about the above because Promotion processes are user-defined and the actions carried out by each step are opaque to the rest of Kargo. This means there's no possibility of "try again starting from step n," but I will open a separate issue with a proposal for a "FailurePolicy" that could have, among other options, the option to start over from step 0.

Circling back to the subject at hand, as I said, I couldn't reproduce that issue, so it would be interesting @stephaneetje if you are able to identify what you and I have done differently. I attempted to replicate your configuration as best I could, but there there's clearly some consequential difference.

hiddeco commented 1 day ago

Running a build with #2936 in it should make it much easier to trace the reason without having to try to reproduce it.

ghcr.io/hiddeco/kargo:1.0.4-rc.66550612 built from our release-1.0 branch should log the reason for the infinite loop, if any.

hghtwr commented 1 day ago

I have the same problem and this is what I found in the logs:

time="2024-11-20T18:21:15Z" level=debug msg="sync result revisions [0b8559a96d3441da09b2c84f6bd85f8fa06e56d1] do not match desired revisions [62429f4247599ff1746023c29e4514e24dee322e]" freight=d82c676db847dce25b54c3328801ec10fae9592b namespace=kargo-poc-tags promotion=teststable-tags.01jd5cw76w9waj367f69rsk1ek.d82c676 stage=teststable-tags

While 0b8559 being the head of the branch (with new manifests in a separate folder) and 62429f being the commit the freight originated from.

Is it by any chance related to using the same source branch to merge the results back to?

stephaneetje commented 1 day ago

@krancour i just tried with only the name :

      - uses: argocd-update
        config:
          apps:
            - name: scim-staging

Step passes well, but i end up with the application being synced successfully, then outofsync (image has not been updated). I checked the revision used by the sync, it is the good one with the image change.

Capture d’écran 2024-11-20 à 20 24 42 Capture d’écran 2024-11-20 à 20 24 57

I have to do another manual sync

Also kargo documentation sais that apps[].sources[].repoURL is mandatory, thats why i had it before, and because i thought that would make sure the synced commit is the one from precedent step.

krancour commented 1 day ago

Step passes well, but i end up with the application being synced successfully, then outofsync (image has not been updated). I checked the revision used by the sync, it is the good one with the image change.

This has the appearance of an Argo CD quirk/issue and not a Kargo issue.

Also kargo documentation sais that apps[].sources[].repoURL is mandatory, thats why i had it before, and because i thought that would make sure the synced commit is the one from precedent step.

Ahhhhh. This explains everything. 😄 I had not noticed that source update.

sources isn't mandatory. repoURL is mandatory on each source, if you specify any, that is. That is for updating individual sources of the Application -- which is something you have no need to do. All you want to do is trigger a sync.

When you do perform any kind of updates on a source, there's a whole additional layer of stuff that happens when checking if the sync was successful. It compares what the sources look like to what you want them to look like. The discrepancy was here (somehow?) and because of it, the sync was being re-attempted over and over again because it kept looking like it hadn't worked.

stephaneetje commented 1 day ago

Step passes well, but i end up with the application being synced successfully, then outofsync (image has not been updated). I checked the revision used by the sync, it is the good one with the image change.

This has the appearance of an Argo CD quirk/issue and not a Kargo issue.

Also kargo documentation sais that apps[].sources[].repoURL is mandatory, thats why i had it before, and because i thought that would make sure the synced commit is the one from precedent step.

Ahhhhh. This explains everything. 😄 I had not noticed that source update.

sources isn't mandatory. repoURL is mandatory on each source, if you specify any, that is. That is for updating individual sources of the Application -- which is something you have no need to do. All you want to do is trigger a sync.

When you do perform any kind of updates on a source, there's a whole additional layer of stuff that happens when checking if the sync was successful. It compares what the sources look like to what you want them to look like. The discrepancy was here (somehow?) and because of it, the sync was being re-attempted over and over again because it kept looking like it hadn't worked.

I opened an Argocd issue, but i feel like kargo is syncing before refreshing when i look at the application avents

Capture d’écran 2024-11-21 à 09 19 14
andrii-korotkov-verkada commented 16 hours ago

Does Kargo wait for refresh to finish before sending a sync signal?

krancour commented 15 hours ago

I opened an Argocd issue, but i feel like kargo is syncing before refreshing when i look at the application avents

Does Kargo wait for refresh to finish before sending a sync signal?

We patch the Application with a sync operation. When Argo CD syncs, it is smart about not relying on its cache. So the short answer is no. The longer answer is, "no because there's no need."

After the sync is initiated, Kargo follows up on it as many times as necessary to find evidence that the exact sync operation it initiated has since completed.

andrii-korotkov-verkada commented 14 hours ago

Seems like app starts showing out of sync only when sync is already running, indicating there can be some race. Idk about exact reasons though.