akuity / kargo

Application lifecycle orchestration
https://kargo.akuity.io/
Apache License 2.0
1.55k stars 133 forks source link

feat(controller): SSH support for Git repositories in Warehouses #2156

Closed gnadaban closed 3 months ago

gnadaban commented 3 months ago

Changes in PR:

netlify[bot] commented 3 months ago

Deploy Preview for docs-kargo-akuity-io ready!

Name Link
Latest commit b8c8cd467a03dcb1c0ef3b1096a11646814dea77
Latest deploy log https://app.netlify.com/sites/docs-kargo-akuity-io/deploys/6675c3b973fd3c0008847aad
Deploy Preview https://deploy-preview-2156.kargo.akuity.io
Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

codecov[bot] commented 3 months ago

Codecov Report

Attention: Patch coverage is 14.28571% with 6 lines in your changes missing coverage. Please review.

Project coverage is 46.30%. Comparing base (cf87a7b) to head (b8c8cd4). Report is 6 commits behind head on main.

Files Patch % Lines
internal/controller/git/git.go 14.28% 6 Missing :warning:
Additional details and impacted files ```diff @@ Coverage Diff @@ ## main #2156 +/- ## ========================================== - Coverage 46.32% 46.30% -0.02% ========================================== Files 242 242 Lines 16754 16767 +13 ========================================== + Hits 7761 7764 +3 - Misses 8623 8632 +9 - Partials 370 371 +1 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

gnadaban commented 3 months ago

Can I please get a review on this? :)

gnadaban commented 3 months ago

@hiddeco is there anything else to do other than updating the API definition comments? I assume generation will be done during release?

krancour commented 3 months ago

What I am now running into is that promotions that push anything to a git repo are failing. It took me a while to notice what was actually going on...

While a promotion is running, it is re-reconciled periodically, usually to check the status of something external, like a pull request or an Argo CD Application...

It seems all parts of the promotion process re-execute on each reconciliation. Arguably, this was by design. As long as the promotion process is idempotent, then running through all the steps each time ensures that if a previous reconciliation attempt had abruptly failed (due to controller restart, for instance) everything will work itself out.

Unfortunately, this also means the the repo gets cloned many times in fairly rapid succession and GitHub, at least, seems to be rate-limiting ssh connections. So what's actually going on is all the steps of the promotion are succeeding, all the way up to forcing a sync/refresh of an Argo CD Application. On subsequent reconciliation attempts (to follow-up on the status of the Argo CD App) we hit the rate limit when trying to clone, and the whole promotion then errors out, despite being, laregly, complete.

I'm going to open a separate issue to see how we can improve this behavior, but I might want to delay merging this until that is resolved because anyone trying ssh + GitHub is likely to run into this.

gnadaban commented 3 months ago

It seems all parts of the promotion process re-execute on each reconciliation. Arguably, this was by design. As long as the promotion process is idempotent, then running through all the steps each time ensures that if a previous reconciliation attempt had abruptly failed (due to controller restart, for instance) everything will work itself out.

I found this out the hard way as well, and was something I wanted to bring up but I'm not sure if opening an Issue in Github is the right forum. This was causing me grief in the form of constantly changing a template-time generated credential, which I was only able to get around by switching my testing to a different app that doesn't have one. I'd say some optimization in this area is very much warranted, if for nothing else then due to the cross-regional network cost implications of the traffic it causes.

I'm going to open a separate issue to see how we can improve this behavior, but I might want to delay merging this until that is resolved because anyone trying ssh + GitHub is likely to run into this.

Please CC me on that issue if you don't mind, I'd like to stay in the loop as it's an important feature we'd benefit from as well.

That said, can I please ask that at least this fix is not held back for a tangential reason? I appreciate your concern, but this fix is solid and would unblock more people than would hinder IMO. The method of cloning from Git does not change the fact of the repeated sync if others are already using the HTTPS auth method, right?

krancour commented 3 months ago

@gnadaban https://github.com/akuity/kargo/issues/2178

krancour commented 3 months ago

I've just come to realize that promotions involving a pull request aren't going to work with ssh creds.

krancour commented 3 months ago

I've just come to realize that promotions involving a pull request aren't going to work with ssh creds.

Thinking about this more, if one really wanted to do ssh + PRs:

  1. Kargo would need the ability to use separate creds for interacting with the repo and for interacting with the git provider's API.

  2. Users would be required to handle two credentials for updates to one repo.

Since I cannot see the sense in no.2, I cannot see the wisdom of bothering with no.1.

The simplest thing may be to say ssh and pull requests won't work together.

I believe this PR should amend the validating webhook for Stages to return an error in the event that any gitRepoUpdate is configured with an ssh URL and wants to use a pull request.

gnadaban commented 3 months ago

I'm not sure I understand why SSH couldn't work with gitRepoUpdate, but I'm willing to look into it, and address in a follow-up PR.

The changes included in this PR so far however unblock our team, and at least partially fixes SSH support for those that don't use Git promotion, only ArgoCD. Can we merge this in, so I can rebase #2160 on this, and I can open a follow-up PR to look into either making gitRepoUpdate +SSH work, or amend the webhook as you suggest.

My concern is primarily the time delay between each roundtrip and that this causes our team to have to maintain a fork and an unofficial build to be able to proceed with testing Kargo.

krancour commented 3 months ago

I'm not sure I understand why SSH couldn't work with gitRepoUpdate

The most recent issue I mentioned is with pull request-based promotions, specifically. Pull requests aren't a git feature. It's a (common) feature of individual platforms like GitHub or GitLab. To open them or follow up on their status involves interacting with those platforms' APIs. Those APIs are usually over HTTPS, so an SSH key isn't a useful credential for authenticating to them.

The more general issue -- not specific to PR-based updates -- is the one I raised in #2160.

The changes included in this PR so far however unblock our team, and at least partially fixes SSH support for those that don't use Git promotion, only ArgoCD.

I appreciate that merging this unblocks you, but especially given that promotions not involving git are not the norm, using this feature will break things for most other users. We're not going to merge something because it was good enough for one user, while opening the door to problems for most others. Who do you think has to deal with the influx of bug reports?

The compromise I would be willing to make here is that if the regex change to stage_types.go is undone (and codegen re-run), it would close the door on git-based promotions using SSH. SSH would work fine in Warehouses and that's it.

This way, at least, the inevitable bug reports will be like, "My warehouse can use an SSH URL, but my promotion mechanism can't," and we'll have to explain that it's not implemented/supported yet. Which is, at least, somewhat easier to deal with than explaining why Kargo let them do something that we knew would fail.

gnadaban commented 3 months ago

First of all, thanks for your swift and detailed reply, I really appreciate both!

The most recent issue I mentioned is with pull request-based promotions, specifically. Pull requests aren't a git feature. It's a (common) feature of individual platforms like GitHub or GitLab. To open them or follow up on their status involves interacting with those platforms' APIs. Those APIs are usually over HTTPS, so an SSH key isn't a useful credential for authenticating to them.

This makes a lot of sense, I didn't look into how git-based promotions work, and just assumed it's simply pushing commits to a repo.

I appreciate that merging this unblocks you, but especially given that promotions not involving git are not the norm, using this feature will break things for most other users. I'm not sure I fully see the scope from my admittedly limited understanding of the project, and what you say makes sense.

Adding SSH-based Warehouse support without git promotion was my original goal with the PR, admittedly not knowing that the PR based promotion is the mainstream for others (it wouldn't be for us, at least not that I can foresee).

We're not going to merge something because it was good enough for one user, while opening the door to problems for most others. Who do you think has to deal with the influx of bug reports?

That's fair. I don't mean to add to your burdens, I'm just trying to add a small improvement. It is not my intention to break things for others in any way though, so I greatly appreciate your compromise suggestion.

The compromise I would be willing to make here is that if the regex change to stage_types.go is undone (and codegen re-run), it would close the door on git-based promotions using SSH. SSH would work fine in Warehouses and that's it.

I will go ahead and make this suggested change, and hope that I can repay you with other PRs for helping see this through. :)

krancour commented 3 months ago

@gnadaban thank you for understanding and also for your patience and all the hard work on this.

This looks good to me now, but I want to give it a final test-drive, which I should be able to get to before EOW.