The exact conditions that precipitated this proposal were many Stages whose Promotion processes all attempt pushing to the same branch. Unsurprisingly, this can create races between concurrent Promotions. In the time between one Promo checking out the relevant branch and pushing a new commit to it, another Promotion may have pushed its own commit to that branch, thereby creating a conflict that causes the first Promotion's git-push step to fail.
This is one of many reasons I strongly promote using a dedicated branch per Stage as a sort of storage, but this issue isn't about the wisdom or folly of any particular approach. The scenario above is merely an accessible example of a Promotion failure that could be resolved simply by repeating the steps of the Promotion process again, starting from 0.
With Promotion processes being entirely user-defined, it's not really possible to build any intelligent recovery logic directly into the git-push step. It seems, however, that there is a range of simple and generic "FailurePolicies" that could be quite useful.
Some ideas for further discussion:
Start the Promotion again from step 0 (retry up to some limit)
Let the Promotion fail then automatically create a new one just like it (retry up to some limit)
Do nothing
Let the Promotion fail then automatically create a new to return the Stage to its previous state (retry up to some limit)
Other...
Users could select a policy from these options and we can add more options over time.
Another complementary idea is for individual steps to be able to provide a hint in a failure result as to how best to proceed.
We've heard many ask for automatic rollbacks before, though we have no issue for it. I would propose that this notion of FailurePolicies might be the correct angle from which to approach that.
From https://github.com/akuity/kargo/issues/2968#issuecomment-2489188432
The exact conditions that precipitated this proposal were many Stages whose Promotion processes all attempt pushing to the same branch. Unsurprisingly, this can create races between concurrent Promotions. In the time between one Promo checking out the relevant branch and pushing a new commit to it, another Promotion may have pushed its own commit to that branch, thereby creating a conflict that causes the first Promotion's
git-push
step to fail.This is one of many reasons I strongly promote using a dedicated branch per Stage as a sort of storage, but this issue isn't about the wisdom or folly of any particular approach. The scenario above is merely an accessible example of a Promotion failure that could be resolved simply by repeating the steps of the Promotion process again, starting from 0.
With Promotion processes being entirely user-defined, it's not really possible to build any intelligent recovery logic directly into the git-push step. It seems, however, that there is a range of simple and generic "FailurePolicies" that could be quite useful.
Some ideas for further discussion:
Users could select a policy from these options and we can add more options over time.
Another complementary idea is for individual steps to be able to provide a hint in a failure result as to how best to proceed.
We've heard many ask for automatic rollbacks before, though we have no issue for it. I would propose that this notion of FailurePolicies might be the correct angle from which to approach that.
cc @jessesuen and @hiddeco for input.