promotion process unnecessarily clones repos repeatedly -- can encounter rate limit

krancour commented 1 week ago

I'll stop short of calling this a bug, because I know it's by design, but still presents a bit of a problem...

Consider a Stage employing the following promotion mechanisms:

  promotionMechanisms:
    gitRepoUpdates:
    - repoURL: https://github.com/example/repo.git
      kustomize:
        images:
        - image: nginx
          path: stages/test
    argoCDAppUpdates:
    - appName: kargo-demo-test

When a Promotion is in a Running phase, it is periodically re-reconciled to follow up on the state of the the Application. In the course of those repeated reconciliations, the gitRepoUpdates are performed anew each time. They're idempotent, so this is ok, but not very efficient.

I know this to be by design, because it means an abruptly terminated reconciliation attempt (due to controller restart, perhaps) can recover.

The inefficiency of this becomes a bigger problem in the context of #2156 (for instance). When a repo is cloned several times in rapid succession using SSH, it seems we run the risk of bumping into what appears to be a rate limit on SSH connections to GitHub. The result is that a first reconciliation attempt generally succeeds, all the way through to forcing a the Argo CD App to refresh, meaning the promotion is, largely speaking, complete, but will likely fail on a subsequent reconciliation attempt.

I'm trying to think of a clean way to improve this...

@hiddeco what would you think about replacing the Running phase with two new phases -- Mutating and Waiting?

On successful completion of Git or Argo CD updates we could transition from Mutating to Waiting. On subsequent reconciliations, parts of the promotion process could be skipped if the Promotion is found to be in a Waiting state.

Just one idea -- I am open to others.

krancour commented 1 week ago

cc @gnadaban

hiddeco commented 1 week ago

I wonder if another idea could be to cache the cloned Git repository until the Promotion has reached its end phase?

gnadaban commented 1 week ago

Caching the repo would be indeed super useful, as well as being able to specify a persistentvolume to clone to repositories to. Orgs with large repos might want to keep the cache between Pod restarts.

krancour commented 1 week ago

I don't think caching the repo between reconciliation attempts completely solves this problem.

I misspoke when I said this:

In the course of those repeated reconciliations, the gitRepoUpdates are performed anew each time. They're idempotent, so this is ok, but not very efficient.

A not-completely-idempotent update is possible. @gnadaban has already encountered this. Suppose you are rendering and the output of some Helm chart is not completely deterministic -- perhaps it uses timestamps somewhere, generates a random string, or uses a bcrypt hash. In such a case, repeated reconciliation attempts will keep pushing new changes to the repo.

So caching may make repeating certain steps of a promotion more efficient, but it doesn't address that we ought to avoid repeating certain steps in the first place.

gnadaban commented 1 week ago

Ideally the ArgoCD promotion would only post an update if the currently deployed revision is not the desired one. This check could be done early in the reconciliation loop, which could exit without re-cloning the repo in that case.

Would that work?

krancour commented 1 week ago

There are circumstances besides waiting for an Argo CD App to sync where a promotion may be long-running and require many (possibly very many) reconciliations to complete. I expect more things of this nature in the relatively near future. One would involve following up on the status of a PR. Waiting for an Argo CD App was just, coincidentally, the thing that brought this issue to light.

krancour commented 1 week ago

I'm thinking about other things we know are on the horizon and if we were to split Running into Mutating (or Running) and Waiting phases, I think we'd find pretty quickly that there are cases where we don't transition from Running to Waiting, but actually transition in the other direction. Consider a promotion that requires a manual approval -- something we know we'll do eventually. We might go from Waiting for approval, to Running the promotion, and back to Waiting, this time waiting for an Argo CD App to sync.

I think phases like Running and Waiting, especially coupled with the existing message field (currently used only for failures and errors) would be useful for users to gain insight into what a Promotion is doing and how far along it is, but I think we might benefit from something more fine-grained that informs the reconciler of where the previous reconciliation attempt left off.

We absolutely need to do this, because returning to the manual approval example, replaying a manual approval step would be far more problematic than the original issue re: re-cloning and re-applying config management tools.

@hiddeco you and I also had some recent, offline discussion related to #1680 and promotions evolving in a direction where the steps are more similar to directives in a Dockefile. e.g. Clone this, copy that, run Kustomize, push the result.

We could do something along the lines of patching a history field (compare to shell history) as each directive succeeds. Failure to patch the history could still result in a step being repeated in the next reconciliation attempt, but in the average case, if a Promotion goes to sleep waiting for something and is subsequently re-reconciled, there will be a good paper trail showing how far along it is in the process and no completed steps will repeat.

We probably need to tackle this in tandem with #1608/#2154.

akuity / kargo

promotion process unnecessarily clones repos repeatedly -- can encounter rate limit #2178