Open TheSpiritXIII opened 4 months ago
I am a little confused, given configuration reloading can happen at any time duration of the Prometheus runtime. We also have the initContainer ensuring an empty config is present before Prometheus starts to prevent transient crashlooping.
Can you elaborate on why this is necessary or why the other PR is failing without this?
@pintohutch thanks, I expanded on the PR description. Please let me know if it doesn't make sense!
This change causes the config-reloader to start running before Prometheus is running.
The problem is that the config-reloader never runs if the Prometheus instance starts crash-looping. This can create a small delay before Prometheus actually starts scraping metrics. Since the config-reloader won't update configs until Prometheus is ready, the config-reloader today has to wait for the Prometheus instance to recover first: the Prometheus instance will startup with the old configuration, and only after will be updated with the new configuration.
When we move flags to the configuration-side, this becomes critical. This is because the rule-evaluator will not startup without the appropriate configuration. Since it never starts up, the config-reloader will never write the configuration. They are essentially both waiting for each other, like a deadlock. Without this, this PR fails: https://github.com/GoogleCloudPlatform/prometheus-engine/pull/1059
As mentioned earlier, the config-reloader must hit a reload URL. As a "hack", I made it hit a readiness endpoint served by itself that functionally does nothing. When Prometheus comes up, it starts hitting the Prometheus' readiness endpoint.