akuity / kargo

Application lifecycle orchestration
https://kargo.akuity.io/
Apache License 2.0
1.39k stars 114 forks source link

Stage incorrectly marked as unhealthy when application is suspended #2216

Open billyshambrook opened 3 days ago

billyshambrook commented 3 days ago

Description

When the application uses a Rollout with a pause step, ArgoCD application status is suspended. Kargo incorrectly shows this as unhealthy.

Screenshots

Steps to Reproduce

Version

kargo version
Client Version: v0.7.1
Server Version: v0.7.1

Logs

Paste any relevant application logs here.
krancour commented 2 days ago

I would want @hiddeco to weigh in on this, as this seems to me to be a gray area.

While Application health influences Stage health, there is no direct equivalence between the two. To illustrate, note that some Stages may involve multiple Applications, in which case a Stage can be no healthier than the least healthy App. Also, because a Stages host specific freight that reference specific versions of artifacts, Stages have some notion of what revision each App should be synced to and deviation from those expectations is a health problem. All of this is to say that compared with App health, Stage health reflects a much more comprehensive analysis of what is going on.

On to the question of what should a Stage's health actually be when an associated App is Suspended...

I see a lot of different possibilities and various consequences of each:

  1. App is Suspended, so Stage is Suspended.

    This is a health state that Stages don't currently have. It would be new.

    The question I ask myself here is to what extent this is useful/meaningful information at the Stage level. If all Apps associated with the Stage are healthy, apart from one or more of them being Suspended, what does that mean to Kargo? Doesn't it mean the Stage is still attempting to reach a healthy state? And how does that differ from Progressing?

    Which leads directly to the next option...

  2. App is Suspended, so Stage is Progressing

    Since I understand that Stage health and App health are different things, this makes the most sense to me, but I realize that this is probably going to confuse most users.

  3. App is Suspended, so Stage is Healthy

    This is true from a certain perspective. Nothing is actually wrong. The obvious problem is that when a Stage becomes Healthy for the first time following a Promotion, any associated verification processes are triggered, but they really ought not be until associated Applications are actually Healthy.

    So this is probably a bad option.

  4. App is Suspended, so Stage is Unhealthy

    This is where we're at today and this behavior was clearly surprising -- thus why we're here having this conversation.

    This is true from a certain perspective. The Stage has not reached a stable state and it would be premature to move on to executing any verification processes. So if it's not Healthy and not Unhealthy, it's Progressing, right? (No.2 -- which I conceded will likely confuse people.)

I apologize for my long-windedness here. I don't think there's any obvious answer to this conundrum, which is why I'd like @hiddeco's take on it -- maybe @jessesuen's as well.

billyshambrook commented 2 days ago

Yep, that's a tricky one. I immediately thought it should be in the "progressing" state, as something is still in progress. That makes sense, but...

...the in-progress deployment could require some user action if, for example, it's a rollout pause step awaiting manual approval. Here, introducing a suspended status starts to make sense as there is something for a user to take action on. Thinking about it, it feels like Argocd should actually show "progressing" unless the pause step needs user action; otherwise, everything is "progressing" as expected, at least at the app/stage level.

krancour commented 2 days ago

@billyshambrook you raise a good point. I'd actually forgotten that a pause step in a Rollout could be indefinite pending user action.

I think you're on the right track calling out that suspended and suspended indefinitely pending user action are two very different things. It would be nice if Kargo could surface cases of the latter while treating cases of the former like business-as-usual: Progressing. (Though I still think a Progressing Stage with a Suspended App is bound to confuse some people. But we can come back to this later.) Realistically Kargo can only distinguish between suspended and suspended indefinitely pending user action if the GitOps agent (Argo CD in this case, but theoretically others in the future) can surface that detail.

We'll start with some digging into whether an Argo CD App status has sufficient information for Kargo to make such a determination and we'll go from there.

Back on the subject of potential confusion if a Stage is Progressing while an App is not indefinitely Suspended... maybe we can clear that up (and also drive home the reality of Stage health and App health being related, but not equivalent) by choosing a word other than "Progressing" to convey that a Stage is waiting for underlying things to reach a Healthy state.