PrefectHQ / prefect-operator

A Kubernetes operator for managing Prefect servers and work pools
10 stars 0 forks source link

Update conditions more conservatively when resources are in the desired state #39

Closed chrisguidry closed 2 months ago

chrisguidry commented 2 months ago

Before, all of the final else blocks of reconciling were updating the condition to say effectively "still all good here!", which was overwriting whatever the prior iterations had updated even though there was no materially new information.

In this change, I'm putting guards around those final updates so that we only update them if

mitchnielsen commented 2 months ago

Tested by deploying per the readme and watching the conditions on each CRD:

prefectserver

$ kubectl get prefectserver prefect-sqlite -o jsonpath='{.status.conditions}'| jq
[
  {
    "lastTransitionTime": "2024-08-20T21:42:12Z",
    "message": "PersistentVolumeClaim was created",
    "reason": "PersistentVolumeClaimCreated",
    "status": "True",
    "type": "PersistentVolumeClaimReconciled"
  },
  {
    "lastTransitionTime": "2024-08-20T21:42:12Z",
    "message": "MigrationJob is not required",
    "reason": "MigrationJobNotRequired",
    "status": "True",
    "type": "MigrationJobReconciled"
  },
  {
    "lastTransitionTime": "2024-08-20T21:42:12Z",
    "message": "Deployment was created",
    "reason": "DeploymentCreated",
    "status": "True",
    "type": "DeploymentReconciled"
  },
  {
    "lastTransitionTime": "2024-08-20T21:42:12Z",
    "message": "Service was created",
    "reason": "ServiceCreated",
    "status": "True",
    "type": "ServiceReconciled"
  }
]

Able to access server (by port-forwarding the Service):

image

prefectworkpool

These Pods are failing to come up:

httpcore.ConnectError: [Errno -2] Name or service not known

However the status condition on the CR makes it appear healthy:

$ kubectl get prefectworkpool sample-pool -o jsonpath='{.status.conditions}'| jq
[
  {
    "lastTransitionTime": "2024-08-20T21:42:12Z",
    "message": "Deployment is in the correct state",
    "reason": "DeploymentUpdated",
    "status": "True",
    "type": "DeploymentReconciled"
  }
]
$ kubectl get prefectworkpool sample-pool
NAME          TYPE      VERSION     DESIRED WORKERS   READY WORKERS
sample-pool   process   3.0.0rc18   3                 0

We may just eventually add a status condition reporting that the desired number of workers has not yet been satisfied, but that'll be out of the scope of this PR.

chrisguidry commented 2 months ago

Yeah it's still way more verbose than I'd like. Using SetStatusCondition takes care of a lot of it, but it's still just a verbose thing (with type, status, reason, and message every time). One thing that might make it cleaner is to use a function like you saw in cert-manager, but maybe one for each status (true or false):

func observeCondition(ctx context.Context, workspace *cloudv1.Workspace, type, reason, message string) error {
  ...
}
func observeAbsence(ctx context.Context, workspace *cloudv1.Workspace, type, reason, message string) error {
  ...
}

so then when you observeCondition, it's status "True", and when you observeAbsence it's "False"? That might help with some of the eyebleed.

Also, maybe if we get errors setting the status, we just log those failures instead of returning them back and stopping the process? That makes them a little more "fire and forget". I don't know what kinds of errors would even happen there besides a total failure to talk to the API

chrisguidry commented 2 months ago

I'll take a stab at that in a future PR

mitchnielsen commented 2 months ago

One thing that might make it cleaner is to use a function like you saw in cert-manager, but maybe one for each status (true or false):

Sounds like a good improvement so they're ~ 1 liners instead 👍🏼

Also, maybe if we get errors setting the status, we just log those failures instead of returning them back and stopping the process? That makes them a little more "fire and forget". I don't know what kinds of errors would even happen there besides a total failure to talk to the API

Fair question, looking through some examples I see a pattern of defining some updateStatus function and then defer-ing it (example from Cert Manager). So the status and condition variables are defined early with empty or default values, then the values are updated throughout the method based on activities, and then defer ensures that the status is updated before that method exits. Might be a good way to reduce the API calls, repetition, and if/else instances 🤔

chrisguidry commented 2 months ago

Oh that's amazing, I forgot about defer in Go, it's one of the raddest things they've done