akuity / kargo

Application lifecycle orchestration
https://kargo.akuity.io/
Apache License 2.0
1.75k stars 145 forks source link

proposal: FailurePolicy (or something of the sort) #2972

Open krancour opened 1 day ago

krancour commented 1 day ago

From https://github.com/akuity/kargo/issues/2968#issuecomment-2489188432

The exact conditions that precipitated this proposal were many Stages whose Promotion processes all attempt pushing to the same branch. Unsurprisingly, this can create races between concurrent Promotions. In the time between one Promo checking out the relevant branch and pushing a new commit to it, another Promotion may have pushed its own commit to that branch, thereby creating a conflict that causes the first Promotion's git-push step to fail.

This is one of many reasons I strongly promote using a dedicated branch per Stage as a sort of storage, but this issue isn't about the wisdom or folly of any particular approach. The scenario above is merely an accessible example of a Promotion failure that could be resolved simply by repeating the steps of the Promotion process again, starting from 0.

With Promotion processes being entirely user-defined, it's not really possible to build any intelligent recovery logic directly into the git-push step. It seems, however, that there is a range of simple and generic "FailurePolicies" that could be quite useful.

Some ideas for further discussion:

Users could select a policy from these options and we can add more options over time.

Another complementary idea is for individual steps to be able to provide a hint in a failure result as to how best to proceed.

We've heard many ask for automatic rollbacks before, though we have no issue for it. I would propose that this notion of FailurePolicies might be the correct angle from which to approach that.

cc @jessesuen and @hiddeco for input.