GoogleCloudPlatform / prometheus-engine

Google Cloud Managed Service for Prometheus libraries and manifests.
https://g.co/cloud/managedprometheus
Apache License 2.0
191 stars 89 forks source link

gmp-operator OOMKilled when turning on targetStatus #774

Closed andreasmh closed 7 months ago

andreasmh commented 8 months ago

We tried to enable targetStatus on the OperatorConfig, according to these docs: https://cloud.google.com/stackdriver/docs/managed-prometheus/setup-managed#target-status

apiVersion: monitoring.googleapis.com/v1
kind: OperatorConfig
metadata:
  namespace: gmp-public
  name: config
features:
  targetStatus:
    enabled: true

When that was enabled, the gmp-operator started crashlooping because it was being OOMKilled. We saw no logs or anything that could indicate what the problem was from the operator pod.

When we tried to disable the targetStatus feature:

apiVersion: monitoring.googleapis.com/v1
kind: OperatorConfig
metadata:
  namespace: gmp-public
  name: config
features:
  targetStatus:
    enabled: false

We got the following error:

operatorconfigs.monitoring.googleapis.com "config" could not be patched: Internal error occurred: failed calling webhook "validate.operatorconfigs.gmp-operator.gmp-system.monitoring.googleapis.com" : failed to call webhook: Post "https://gmp-operator.gmp-system.svc:443/validate/monitoring.googleapis.com/v1/operatorconfigs?timeout=10s": dial tcp 10.13.51.53:10250: connect: connection refused

So basically, because the gmp-operator was being OOMKilled very soon after startup, we had real issues in trying to change the configuration because the validating webhook is hosted by the gmp-operator.

Things we tried to go back to a working state:

Thing that ultimately worked:

How I would have liked things to work:

I have no idea why the gmp-operator is being OOMKilled, there is no shortage of memory on the worker nodes, and there are no logs being output from the gmp-operator that would indicate a problem.

We did this in two separate projects and GKE clusters, with the same result. Some stats that might be interesting and or help some debugging on why targetStatus enablement makes gmp-operator get OOMKilled:

One cluster has 1800~ pods, 270~ Podmonitoring resources, 1 ClusterPodMonitoring resource The other cluster has 1000~ pods, 310~ Podmonitoring resources, 1 ClusterPodMonitoring resource

The operator is running this version: operator:v0.8.0-gke.4 GKE version 1.27.7-gke.1121000

maxamins commented 7 months ago

Hi Andreas, Thanks for reaching out. I'm glad you were able to fix your issue and thanks for the debug information you provided.

Unfortunately, it's a know issue that TargetStatus struggles on large clusters. For now I recommend you don't use targetStatus on large clusters. We do plan on supporting large clusters in the future.

andreasmh commented 7 months ago

Ok, that is unfortunate. But thanks for the reply! Is there another way of checking the amount of targets of a PodMonitoring or ClusterPodMonitoring resource?

pintohutch commented 7 months ago

Hey @andreasmh - thanks for inquiring!

Apologies you're hitting this. We do have documentation around troubleshooting target status using kubectl port-forward.

It's not as streamlined as the target status reporting feature, but if you know the node where your target-in-question is running, you can investigate that node's collector pod to get this info.

Hope that helps

pintohutch commented 7 months ago

Addressing your other feedback:

How I would have liked things to work:

gmp-operator not being OOMKilled

We would like to benchmark and further optimize this feature to help mitigate these kinds of issues. But our capacity is stretched thin at the moment. We always welcome external contributors to chip in 😄 !

The main problem is that the operator polls every collector pod in the cluster and aggregates the targets information in memory, and then funnels it back out to each PodMonitoring. On large clusters with lots of collectors (and lots of PodMonitorings) this doesn't scale well. We were ok with this trade-off to start, as we envisioned it being a onboarding/debugging tool, which would presumably run on smaller clusters. However, I empathize with leveraging it at scale as well.

There are certainly code optimizations we could make to help things. Though, maybe an easy, brute-force approach would be to limit the number of PodMonitorings we're updating. Or perhaps have it configurable... dunno need to think more there. Created https://github.com/GoogleCloudPlatform/prometheus-engine/issues/792.

No circular dependency between the config of the gmp-operator and the gmp-operator (with validating webhook hosted by the gmp-operator that is configured by the OperatorConfig)

This is a great point. Having to do backflips to get things back running like you did isn't great. I wonder if it's worth failing open in some cases... created https://github.com/GoogleCloudPlatform/prometheus-engine/issues/791

Being able to somehow change the resource requests and limits for the gmp-operator deployment

On the GKE-managed component, this isn't possible directly. However, you could try a VPA for the operator. In fact, we may want to look into shipping it with the component ourseleves (cc @bernot-dev). But gotta think about trade-offs there first...created https://github.com/GoogleCloudPlatform/prometheus-engine/issues/793.

Overall - thanks for this feedback and for using managed collection!

andreasmh commented 7 months ago

Hey @pintohutch!

Thank you for the answers and the link to the troubleshooting documentation, I completely missed that one!

Feel free to close this if you want, considering that you have separate issues for the three things!

pintohutch commented 7 months ago

SGTM - will close