akuity / kargo

Application lifecycle orchestration
https://kargo.akuity.io/
Apache License 2.0
1.56k stars 137 forks source link

bug: if a step-based promotion fails, the stage get stuck in the promoting phase until a new promotion succeeds #2693

Open krancour opened 3 days ago

krancour commented 3 days ago

This doesn't seem like the right thing to do.

After a Promotion fails or errors, a Stage may be between states -- neither the beginning state nor the target state. (Whether this is true or not would really depend on the particulars of the user-defined promotion process.) imho, this calls for us to move the Stage out of the Promoting phase and into some new phase like "Indeterminate."

@hiddeco any thoughts here?

P.S. Let's not have this issue snowball into one about automated rollbacks. Assuming we will do that at some point in the future, recognizing and categorizing these conditions as called for by this issue is certainly a prerequisite.

hiddeco commented 3 days ago

Is the Stage actually still reporting it is promoting (which would be surprising to me, as I think we look at the terminated state of the last promotion)?

Or is this more about the Stage saying it's on Freight X, while it's actually "partially" on Freight Y?

krancour commented 3 days ago

It still says phase is Promoting, which isn't deliberate. But in terms of fixing it, it seems wrong to put it to Verifying or Steady.

hiddeco commented 3 days ago

Based on the code that synchronizes the Promotions for a Stage, I would expect the Stage to be in a Steady phase as soon as the Promotion reaches a terminal state. 🤔

krancour commented 3 days ago

So there are two things here.

  1. I agree with your interpretation of what the code says should be happening and I want to figure out how I've observed something different.

  2. I'm questioning whether what we think the code says should be happening is ideal to begin with. If a Promotion failed, depending on the particulars of the process, "steady" might be a poor descriptor for the currrent state.

Don't worry about this more until I can reproduce it and give some more details about how it happened.

krancour commented 8 hours ago

After a few days, I have been unable to reproduce this, but I maintain that we need to make some kind of Stage status change after a failed Promo. Steady + Healthy doesn't seem appropriate. Depending on the particulars of the user-defined promotion process, the Stage may be in an invalid state and we don't really have a way to know.

hiddeco commented 2 hours ago

I have been unable to reproduce this, but I maintain that we need to make some kind of Stage status change after a failed Promo.

I think the answer to this would be to add Conditions to the Stage, as these are far superior to the current "phase" reflection we do. Every moving element could then get a condition type, allowing you to provide a much better insight into what particular things failed.