Open pintohutch opened 7 months ago
This weekend gmp-operator
started crashlooping with, "no endpoints available for service gmp-operator". While there were no OOMKill events, that turned out to be the cause. Putting in a VPA addressed.
Previous issue with rules-evaluator
I see that the rules-evaluator VPA hasn't moved either. I can live lacking VPAs, but having the webhook fail closed for gmp-operator was excruciating.
Thanks for reporting @clearclaw - I apologize for the frustration.
With the operator being such a mission-critical binary in managed-collection, we can also explore failing open for our webhooks. Would that have helped in this situation over a VPA specifically?
Yes, in that deploys would have worked despite the failure. Being unable to eg ship a critical security fix due to GMP indigestion is not right.
Additionally, the lack of default observability/alerting around GMP remains a concern (gmp-operator or rules-evaluator previously here, but there's more components). If the observability stack fails, all the alarms should be going off. It would be really nice to see/have default Rules objects for all of GMP. (Why are there no podmonitoring objects for the GMP processes?)
Supplemental in case useful -- heading into and during the failures:
+1, some way of handling this would be important, let's prioritize it.
VPA/limits is one thing, but perhaps additionally it would be to prioritize the code optimization phase, to check for low hanging fruits and start to queue things up (trading latency vs memory).
Additionally, the lack of default observability/alerting around GMP remains a concern (gmp-operator or rules-evaluator previously here, but there's more components). If the observability stack fails, all the alarms should be going off. It would be really nice to see/have default Rules objects for all of GMP. (Why are there no podmonitoring objects for the GMP processes?)
For managed collection we don't do this, because we have our own alerting pipeline (however oriented on more fleet-wide situations and unknown situations). On top of that we historically didn't deploy those self-observability resources by default as everyone on GKE would need to pay for those extra metrics. We ship some example configuration anyone can deploy (on GKE std), though missing operator.
To sum up, we have a bit to improve here, thanks for reporting & ideas. Perhaps it's time to add that self-monitoring
feature/option to OperatorConfig but likely opt-in.
Making self-observation opt-in is sensible. Something as simple as a doc page or reference pointing at reference pod monitors and possible alert rules under https://g.co/kgs/dDdszEW would help a lot.
In short, make it easy for people to do good things. -- JCL
On Tue, 20 Aug 2024, 01:05 Bartlomiej Plotka, @.***> wrote:
+1, some way of handling this would be important, let's prioritize it.
VPA/limits is one thing, but perhaps additionally it would be to prioritize the code optimization phase, to check for low hanging fruits and start to queue things up (trading latency vs memory).
Additionally, the lack of default observability/alerting around GMP remains a concern (gmp-operator or rules-evaluator previously here, but there's more components). If the observability stack fails, all the alarms should be going off. It would be really nice to see/have default Rules objects for all of GMP. (Why are there no podmonitoring objects for the GMP processes?)
For managed collection we don't do this, because we have our own alerting pipeline (however oriented on more fleet-wide situations and unknown situations). On top of that we historically didn't deploy those self-observability resources by default as everyone on GKE would need to pay for those extra metrics. We ship some example configuration https://github.com/GoogleCloudPlatform/prometheus-engine/blob/main/examples/self-pod-monitoring.yaml anyone can deploy (on GKE std), though missing operator https://github.com/GoogleCloudPlatform/prometheus-engine/issues/1123.
To sum up, we have a bit to improve here, thanks for reporting & ideas. Perhaps it's time to add that self-monitoring feature/option to OperatorConfig but likely opt-in.
— Reply to this email directly, view it on GitHub https://github.com/GoogleCloudPlatform/prometheus-engine/issues/793#issuecomment-2298227564, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAJYHIK4AM26DO4XMNYMHQ3ZSL2C3AVCNFSM6AAAAABMYJKNZOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEOJYGIZDONJWGQ . You are receiving this because you were mentioned.Message ID: @.***>
Can we find ways to avoid OOM crashes in the gmp-operator? Maybe using a VPA?
Acceptance criteria: