Kong / gateway-operator

Kubernetes Operator for Kong Gateways
Apache License 2.0
48 stars 11 forks source link

BlueGreen rollouts: Improve `RolledOut` condition #162

Open mlavacca opened 1 year ago

mlavacca commented 1 year ago

Problem Statement

At the time of writing, the rollout condition is structured as follows:

// RolledOut is a condition type indicating whether or
// not, DataPlane's rollout has been successful or not.
// Possible reasons for this condition to be True are:
//
// * "PromotionDone"
//
// Possible reasons for this condition to be False are:
//
// * "AwaitingPromotion"
// * "Failed"
// * "Progressing"
// * "PromotionInProgress"

And Kong/gateway-operator-archive#1000 introduced a new reason:

// DataPlaneConditionReasonRolloutWaitingForChange is a reason which indicates a DataPlane
// is waiting for a change to trigger new version to be made available before promotion.
DataPlaneConditionReasonRolloutWaitingForChange k8sutils.ConditionReason = "WaitingForChange"

This increases the number of reasons for which the Status of such a condition can be False to 5 in total. In this scenario, the normal status for the rolledOut condition when no rollout is in progress is False with the reason WaitingForChange. During the whole rollout and promotion process, the status keeps the False value and changes reason. Only at the final stage, when the promotion is completed, it flips to true, to be put again into the WaitingForChange state with False. This final False->True->False transition is so quick that the user won't even likey notice it, with the result of always seeing the RolledOut condition with Status False.

We may want to improve the UX for such a condition, and this issue is about discussing and improving it before 1.0

Alternative solutions

Below are a bunch of alternative solutions that can be discussed

Negative polarity condition

Instead of using a positive polarity condition, we could use a negative polarity condition, where False is the good and stable value, and True is the transitioning one. This condition should be renamed to something like during rollout, and the default status would be False, while the transitioning one would be True with all the above conditions (except WaitingForChange, which would be represented by the normal False value).

Separate Rollout and promotion conditions

Instead of having a single condition about Rollout and promotion in place, we could have two different conditions:

With this solution, the status under normal circumstances is rollout False with reason NotStarted and promotion False with reason Not Needed. The rollout will have a smaller number of reasons with a subsequent improvement in the UX.

Separate CRD

Use a separate CRD that contains all the details of a specific Rollout instance. Whenever the user wants to perform a rollout, a new instance of such a resource needs to be created, and its status will contain the history of that specific rollout operation.

pmalek commented 1 year ago

Broken out Kong/gateway-operator#159 to track the research and design for separate CRD.

This one here can track the effort of improving the condition(s).

cc: @mlavacca

czeslavo commented 1 year ago

After looking into Kong/gateway-operator-archive#1000 and the discussion I think that changing the condition to be negative polarity DuringRollout makes the most sense to not overcomplicate things right now.

I believe that from the UX perspective that would be the most useful - I guess that a typical workflow if someone would like to program against this condition would be to change something in the DataPlane's spec and wait until it flips False again.

In https://github.com/Kong/gateway-operator/issues/1031 we should provide users with a more granular view into the state of particular rollout revisions. It will make sense to have a positive polarity condition there as that will be a final one after the rollout succeeds.

TLDR: +1 from me for Negative polarity condition alternative.

pmalek commented 1 year ago

Please note that a rollout is something different than a promotion.

Rollout, currently is understood as deployment of a set of resources that are to become ready for promotion.

Whereas promotion itself is just a process of updating object's (currently DataPlanes only) metadata and appropriate Service selectors.

Current implementation uses reasons of RolledOut condition to an extent while performing transitions from 1 "state" to another.

mlavacca commented 4 months ago

@pmalek What do you think about this one? Does it make sense to migrate it to the new repo?

pmalek commented 4 months ago

It certainly does but it's all a matter of prioritization when this is going to be needed.