fluxcd / helm-controller

The GitOps Toolkit Helm reconciler, for declarative Helming
https://fluxcd.io
Apache License 2.0
414 stars 164 forks source link

DependsOn readiness check should only rely on Ready condition #946

Open seaneagan opened 7 months ago

seaneagan commented 7 months ago

It appears .status.observedGeneration does not reliably get updated after successful helm release upgrades. I can work on filing a separate bug for this, need to find time to come up with a minimal re-production, seems possible an issue could have been introduced in https://github.com/fluxcd/helm-controller/pull/825/commits/48cad6838633344b70e50670c791ea3399097dd5

However, when checking dependencies it seems preferable to rely solely on the Ready condition, including both its Status and ObservedGeneration and ignore .Status.ObservedGeneration. Which then in turn has the side effect of avoiding the above mentioned bug.

https://github.com/fluxcd/helm-controller/blob/5e760db4a83ffd384012c1b47e4615cc07d4081d/internal/controller/helmrelease_controller.go#L588

souleb commented 7 months ago

can you give more information on the issue?

The code is good for me. We expected a dependency to be ready only if the current version have been successfully reconciled.

seaneagan commented 7 months ago

The Ready condition's ObservedGeneration is reliably getting updated, but not .Status.ObservedGeneration, which is resulting in "dependency not ready" even though it is ready. I think these two fields are mostly duplicative, but probably too late to remove .Status.ObservedGeneration at this point anyway? But I wonder if they should simply mirror each other?

stefanprodan commented 7 months ago

@seaneagan I don't see how .Status.ObservedGeneration is not updated if the release upgrade succeeds given: https://github.com/fluxcd/helm-controller/blob/5e760db4a83ffd384012c1b47e4615cc07d4081d/internal/controller/helmrelease_controller.go#L180-L182

Can you please add a unit test to prove this can actually happen?

Removing .Status.ObservedGeneration is not an option, this field is used by kstatus to determine readiness and almost all standard Kubernetes APIs have it. If .Status.ObservedGeneration is not kept up to date, then Flux kustomize-controller would be stuck waiting for HelmReleases which no one reported so far, so I'm really puzzled by your report.

seaneagan commented 7 months ago

is it possible that an error which is not a TerminalError can happen while the Ready condition still gets set to True? What is the intention of when .status.observedGeneration should get updated? Would it make sense to simply always update it since in the most basic sense the generation has been observed? Or perhaps it can look at the conditions ObservedGenerations to determine whether it has been observed "enough" to ensure consistency vs looking at the transient result of the individual reconciliation?

separately, would it be ok to consider my proposed change independently of the above, as it feels more correct to rely on the Ready condition ObservedGeneration as it is intimately tied to the Ready condition status which is also being checked for dependencies?

seaneagan commented 7 months ago

trying to translate the intention of that if condition into status conditions, perhaps if Ready==true or Stalled==true at that generation, then that generation has been observed?

seaneagan commented 7 months ago

ok, so it looks like i was hitting #925 which was still allowing the Ready condition to be set to true, even though Reconciling was also true. Should Ready take Reconciling into account?

https://github.com/fluxcd/helm-controller/blob/5e760db4a83ffd384012c1b47e4615cc07d4081d/internal/reconcile/release.go#L140

seaneagan commented 7 months ago

this seems pretty nice, wonder if it can be integrated? https://github.com/fluxcd/pkg/blob/60815565aaaa9bc92edc296a107288089764a384/runtime/reconcile/result.go#L89

stefanprodan commented 7 months ago

I think we need a test for when drift correction is in loop that should used the kstatus verifier that catches any miss configuration https://github.com/fluxcd/kustomize-controller/blob/e9f5628eccbfbc722a7637ecbf7f66580e2e4416/internal/controller/kustomization_wait_test.go#L135

The checks and logic we use for the Ready/Stalled/Reconciling is here https://github.com/fluxcd/pkg/blob/60815565aaaa9bc92edc296a107288089764a384/runtime/conditions/check/fail.go

stefanprodan commented 6 months ago

I think this will fix it https://github.com/fluxcd/helm-controller/pull/885

stefanprodan commented 6 months ago

@seaneagan with #885 merged I think this will solve your issue. Ready will reflect now that drift detection is running without reaching correction.