[Design] Dynamic resource usage for GMP operator

pintohutch commented 7 months ago

Can we find ways to avoid OOM crashes in the gmp-operator? Maybe using a VPA?

Acceptance criteria:

Proposal with design and trade-offs

clearclaw commented 3 weeks ago

This weekend gmp-operator started crashlooping with, "no endpoints available for service gmp-operator". While there were no OOMKill events, that turned out to be the cause. Putting in a VPA addressed.

Previous issue with rules-evaluator

ALL deploys were broken due to gmp-operator's webhook being dead.
Really bad visibility on the OOMKill. (I don't know why scheduler events weren't propagating, but they weren't)
As with rules-evaluator: VPA FTW.

I see that the rules-evaluator VPA hasn't moved either. I can live lacking VPAs, but having the webhook fail closed for gmp-operator was excruciating.

pintohutch commented 3 weeks ago

Thanks for reporting @clearclaw - I apologize for the frustration.

With the operator being such a mission-critical binary in managed-collection, we can also explore failing open for our webhooks. Would that have helped in this situation over a VPA specifically?

clearclaw commented 3 weeks ago

Yes, in that deploys would have worked despite the failure. Being unable to eg ship a critical security fix due to GMP indigestion is not right.

Additionally, the lack of default observability/alerting around GMP remains a concern (gmp-operator or rules-evaluator previously here, but there's more components). If the observability stack fails, all the alarms should be going off. It would be really nice to see/have default Rules objects for all of GMP. (Why are there no podmonitoring objects for the GMP processes?)

clearclaw commented 3 weeks ago

Supplemental in case useful -- heading into and during the failures:

bwplotka commented 3 weeks ago

+1, some way of handling this would be important, let's prioritize it.

VPA/limits is one thing, but perhaps additionally it would be to prioritize the code optimization phase, to check for low hanging fruits and start to queue things up (trading latency vs memory).

Additionally, the lack of default observability/alerting around GMP remains a concern (gmp-operator or rules-evaluator previously here, but there's more components). If the observability stack fails, all the alarms should be going off. It would be really nice to see/have default Rules objects for all of GMP. (Why are there no podmonitoring objects for the GMP processes?)

For managed collection we don't do this, because we have our own alerting pipeline (however oriented on more fleet-wide situations and unknown situations). On top of that we historically didn't deploy those self-observability resources by default as everyone on GKE would need to pay for those extra metrics. We ship some example configuration anyone can deploy (on GKE std), though missing operator.

To sum up, we have a bit to improve here, thanks for reporting & ideas. Perhaps it's time to add that self-monitoring feature/option to OperatorConfig but likely opt-in.

clearclaw commented 3 weeks ago

Making self-observation opt-in is sensible. Something as simple as a doc page or reference pointing at reference pod monitors and possible alert rules under https://g.co/kgs/dDdszEW would help a lot.

In short, make it easy for people to do good things. -- JCL

On Tue, 20 Aug 2024, 01:05 Bartlomiej Plotka, @.***> wrote:

+1, some way of handling this would be important, let's prioritize it.

VPA/limits is one thing, but perhaps additionally it would be to prioritize the code optimization phase, to check for low hanging fruits and start to queue things up (trading latency vs memory).

Additionally, the lack of default observability/alerting around GMP remains a concern (gmp-operator or rules-evaluator previously here, but there's more components). If the observability stack fails, all the alarms should be going off. It would be really nice to see/have default Rules objects for all of GMP. (Why are there no podmonitoring objects for the GMP processes?)

For managed collection we don't do this, because we have our own alerting pipeline (however oriented on more fleet-wide situations and unknown situations). On top of that we historically didn't deploy those self-observability resources by default as everyone on GKE would need to pay for those extra metrics. We ship some example configuration https://github.com/GoogleCloudPlatform/prometheus-engine/blob/main/examples/self-pod-monitoring.yaml anyone can deploy (on GKE std), though missing operator https://github.com/GoogleCloudPlatform/prometheus-engine/issues/1123.

To sum up, we have a bit to improve here, thanks for reporting & ideas. Perhaps it's time to add that self-monitoring feature/option to OperatorConfig but likely opt-in.

— Reply to this email directly, view it on GitHub https://github.com/GoogleCloudPlatform/prometheus-engine/issues/793#issuecomment-2298227564, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAJYHIK4AM26DO4XMNYMHQ3ZSL2C3AVCNFSM6AAAAABMYJKNZOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEOJYGIZDONJWGQ . You are receiving this because you were mentioned.Message ID: @.***>

GoogleCloudPlatform / prometheus-engine

[Design] Dynamic resource usage for GMP operator #793