brigadecore / brigade

Event-driven scripting for Kubernetes
https://brigade.sh/
Apache License 2.0
2.4k stars 247 forks source link

improvement: smoother startup #1821

Closed krancour closed 2 years ago

krancour commented 2 years ago

Brigade components do a lot of "flapping" after helm install/upgrade.

The API server depends on both MongoDB and Artemis being up and running. The observer and scheduler depend on the API server being up and running. Components whose network-bound dependencies aren't ready yet still try to start, they fail, and crash loop backoffs occur. If the backoffs progress to lengthy enough intervals between retries, they can really slow down the total time that the install/upgrade takes.

I want to propose that a better approach to all this "free" eventual consistency (courtesy of k8s) may be to use the retries package from brigadecore/brigade-foundations to take control of connection retries ourselves without crashing. The result could be components that are slower to start when their dependencies aren't satisfied yet, but will eventually start without having to suffer crashes in the interim.

krancour commented 2 years ago

More info:

The error message typically seen when the API server crashes on startup is:

error adding indexes to events collection: context deadline exceeded

This is because, per documentation, the mongo.Connect(...) call made at startup doesn't actually perform any I/O.

Attempts to ensure the indices on various collections are the first attempts at any database I/O.

If we test the connection immediately after obtaining it and put that login inside a retry loop, we'll be in a much better position.