akuity / kargo

Application lifecycle orchestration
https://kargo.akuity.io/
Apache License 2.0
1.52k stars 132 forks source link

Distinguish permanent API errors from transient ones #1640

Open hiddeco opened 6 months ago

hiddeco commented 6 months ago

We do at present not distinguish "not found" errors (permanent) from e.g. "the Kubernetes API server temporary can not be reached" (transient). Because of this, a Stage's verification process may fail prematurely while the controller could theoretically automatically recover it, if given the time.

As manually recovering from it is both cumbersome to a user, and potentially a waste of computing power used by the AnalysisRun. I think we can do a better job at distinguishing these type of errors, and prevent giving up on transient ones by e.g. requeueing and not erasing AnalysisRun references, etc.

xref: https://github.com/akuity/kargo/pull/1611#discussion_r1525229572


Note: While I have only observed this to happen for a Stage's verification process, this may actually apply to more areas of Kargo.

krancour commented 3 weeks ago

I think we've made progress on this and there's more to be made still, but I think that, like #1479, this is an on-going effort that we can kick from release to release until we feel satisfied.