improvement: smoother startup

brigadecore / brigade

Event-driven scripting for Kubernetes

Apache License 2.0

2.4k stars 247 forks source link

Brigade components do a lot of "flapping" after helm install/upgrade.

The API server depends on both MongoDB and Artemis being up and running. The observer and scheduler depend on the API server being up and running. Components whose network-bound dependencies aren't ready yet still try to start, they fail, and crash loop backoffs occur. If the backoffs progress to lengthy enough intervals between retries, they can really slow down the total time that the install/upgrade takes.

I want to propose that a better approach to all this "free" eventual consistency (courtesy of k8s) may be to use the retries package from brigadecore/brigade-foundations to take control of connection retries ourselves without crashing. The result could be components that are slower to start when their dependencies aren't satisfied yet, but will eventually start without having to suffer crashes in the interim.

brigadecore / brigade

improvement: smoother startup #1821