warehouse status should have "health" and "phase" fields

krancour commented 1 week ago

WarehouseStatus has a Message field that describes any errors encountered in the course of a reconciliation, but this alone might not be adequate for surfacing what's currently going on with a Warehouse. I have to dig a little too deep to get at this information.

When I run kubectl get warehouses or kargo get warehouses I'd like to see new columns that show with a glance:

Did the last reconciliation succeed or fail?
Is it currently discovering artifacts?

Ever since we move to discovering n recent artifacts instead of only selecting the latest applicable one, a single reconciliation loop can take a bit longer than it used to -- possibly even a couple minutes depending on what's going on in the rest of the system since we rate limit connections on a per-registry basis.

I want to know if my Warehouse is idling or currently doing something.

I propose two possible health states -- Healthy and Unhealthy. If Unhealthy, the message field will contain an error. (That part already works.)

I propose two possible phases -- Discovering and Idling.

This would require very little effort and I believe it would improve the UX significantly for CLI / kubectl users.

@jessesuen @hiddeco any thoughts on this?

hiddeco commented 3 days ago

The way I have dealt with this in the past is using conditions, which are kind of the successors of "phases" in Kubernetes API designs based on:

Some resources in the v1 API contain fields called phase, and associated message, reason, and other status fields. The pattern of using phase is deprecated. — https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/api-conventions.md

A Reconciling condition would then be added when the Kubernetes object is "in motion", and removed once it reaches a stable state.

This makes me wonder if it would not be more practical for Kargo to move to such a pattern, as it allows you to describe one (or more) indication(s) about the resource state.

krancour commented 3 days ago

@hiddeco I'd be open to that if we do it across the board for consistency. If we were to do so, how do you propose we distill the conditions down to something suitably phase-like for easy human consumption?

When I do kargo get <resource type>, I want to be easily able to easily discern the current state from the information in the returned columns.

hiddeco commented 18 minutes ago

Taking Warehouse as an example, this could look something like the following using the metav1.Condition type. Which can easily be disabled as columns using e.g.

// +kubebuilder:printcolumn:name="Healthy",type="string",JSONPath=".status.conditions[?(@.type==\"Healthy\")].status",description=""
// +kubebuilder:printcolumn:name="Ready",type="string",JSONPath=".status.conditions[?(@.type==\"Ready\")].status",description=""

Discovering artifacts and healthy

status:
  conditions:
    - type: Reconciling
      status: True
      reason: ScheduledDiscovery
      message: "Currently discovering artifacts for 3 sources."
      lastTransitionTime: "2024-07-01T12:00:00Z"
    - type: Ready
      status: False
      reason: ScheduledDiscovery
      message: "Currently discovering artifacts for 3 sources."
      lastTransitionTime: "2024-07-01T12:00:00Z"
    - type: Healthy
      status: True
      reason: AllSourcesAccessible
      message: "All 3 sources are accessible."
      lastTransitionTime: "2024-07-01T12:00:00Z"

Idling and healthy

status:
  conditions:
    - type: Ready
      status: True
      reason: ArtifactsDiscovered
      message: "Discovered 86 artifacts from 3 sources."
      lastTransitionTime: "2024-07-01T12:00:00Z"
    - type: Healthy
      status: True
      reason: AllSourcesAccessible
      message: "All 3 sources are accessible."
      lastTransitionTime: "2024-07-01T12:00:00Z"

Discovering artifacts and unhealthy

status:
  conditions:
    - type: Reconciling
      status: True
      reason: ScheduledDiscovery
      message: "Currently discovering artifacts for 3 sources."
      lastTransitionTime: "2024-07-01T12:00:00Z"
    - type: Ready
      status: False
      reason: SourceInaccessible
      message: "One or more sources are inaccessible."
      lastTransitionTime: "2024-07-01T12:00:00Z"
    - type: Healthy
      status: False
      reason: SourceInaccessible
      message: "One or more sources are inaccessible."
      lastTransitionTime: "2024-07-01T12:00:00Z"

For other Kargo API kinds, we would have to translate the existing phases to condition types like the above. The goal here would be to share certain condition types as much as we can across kinds (e.g. Ready, Healthy, and Reconciling) to describe the "high level" state, which eases consumption over having to learn the specific condition types for each kind.

However, specific ("lower level") condition types would still be allowed to exist, and can often be summarized, composed and/or merged into these "higher level" types.

akuity / kargo