collector OOMKilled crashloop in GKE Autopilot

GoogleCloudPlatform / prometheus-engine

Google Cloud Managed Service for Prometheus libraries and manifests.

https://g.co/cloud/managedprometheus

Apache License 2.0

195 stars 93 forks source link

collector OOMKilled crashloop in GKE Autopilot #338

Closed willchan closed 1 year ago

willchan commented 2 years ago

We've been running GMP managed collectors in our GKE Autopilot by deploying manifests, per https://cloud.google.com/stackdriver/docs/managed-prometheus/setup-managed#config-mgd-collection, for half a year now. We've seen crashlooping before, but I only recently dove into it and noticed that the crashes, at least of late, are OOMKilled.

  - containerID: containerd://f011670b9de0ec82bed8130d76e818505802b95313118adbc2066835f9c20231
    image: gke.gcr.io/prometheus-engine/prometheus:v2.35.0-gmp.2-gke.0
    imageID: gke.gcr.io/prometheus-engine/prometheus@sha256:02449f32a9fa8ab02c529eb34aa5c13f49d5b6c56f5d3cf23ff7ca23af6eb7d2
    lastState:
      terminated:
        containerID: containerd://f011670b9de0ec82bed8130d76e818505802b95313118adbc2066835f9c20231
        exitCode: 137
        finishedAt: "2022-09-30T16:09:41Z"
        reason: OOMKilled
        startedAt: "2022-09-30T16:08:50Z"

I looked at the different Pods in the DaemonSet, and their memory utilization varies. My understanding is the collectors only scrape Pods on the same Node, so what's probably the cause of the disparate memory utilization is the heterogeneity in the Pods on Nodes, or rather, the Pods' exposed metrics which the collector is scraping and exporting to Monarch.

DaemonSet memory usage:

Healthy Pod memory usage:

OOMKilled Pod memory usage:

Given this info, I think GKE Autopilot's default VerticalPodAutoscaler is learning a memory limit that is too low for some Pods. I'm not really sure whose bug this should be. One could say that VPA is learning too low of a limit, but this value seems to work for most of the Pods in the DaemonSet. So maybe the problem is that the Pods in the DaemonSet shouldn't have heterogenous memory usage. That's what I'm leaning to. So I think this is probably a GMP collector bug.

The net result is that some of my Pods go long periods (many hours, perhaps days) without having their metrics scraped, due to the Node-local collector Pod crashlooping. I also suspect it's our Pod that exports the most timeseries that is not getting scraped, since I frequently see alerts firing that rely on those timeseries.

I think I can work around this by applying a custom VerticalPodAutoscaler with a higher memory limit, but it doesn't seem like that should be necessary.

lyanco commented 2 years ago

Hey Will,

We're working hard on official Autopilot support - once that's out we can see if this still happens or if it's only happening because of the unofficial way you're deploying it.

cc @TheSpiritXIII

willchan commented 2 years ago

Yeah, I was wondering if that would get fixed with official support. But if my diagnosis is correct, the root problem is that the different collector Pods in the DaemonSet use differing amounts of memory, so the memory limit applied by GKE Autopilot (in absence of a custom VPA) is too low. It's not obvious to me if official Autopilot support would address this somehow. But yes, this is sort of a Autopilot triggered problem, because Autopilot forces default VPA usage. Our non-Autopilot GKE clusters did not have this problem since we were sloppy and didn't configure anything, and I guess the non-Autopilot resource defaults allow more memory usage without getting OOMKilled. So you might see this when you roll out official Autopilot support, that the memory limits are tighter by default, leading to OOMKilled crashlooping.

TheSpiritXIII commented 2 years ago

One recommendation I've seen the GKE Autopilot team make, which might be worth exploring in light of the issue you're seeing, is to set the request and limit memory amounts to the same value. I think that the recommendation doesn't make much sense because it means you're probably always over-allocating, but I wonder if they make that recommendation due to a limitation in how their systems adjust node memory amounts.

Looking at today's values, we are setting requests to 200m and limit to 3G. If I'm looking at the graphs correctly, it looks like GKE Autopilot does not respect the limits. I found documentation that says this would be the case: https://cloud.google.com/kubernetes-engine/docs/concepts/autopilot-resource-requests#resource-limits

There are a few other things on that page that we don't adhere to today.

willchan commented 2 years ago

Indeed, GKE Autopilot sets the limit to the request, which is why we get OOMKilled. That said, I believe the resource & limit configuration is a bit of a red herring, because in GKE Autopilot, VerticalPodAutoscaler is enabled by default. So, over time, the initial resource requests in the DaemonSet get overwritten by the VerticalPodAutoscaler recommendations.

It's interesting to note that the VerticalPodAutoscaler appears incapable of noticing that a Pod is crashlooping with OOMKilled. Or VPA's algorithm chooses not to increase to above the ceil(all pods memory usage), and instead makes the assumption that all Pods in the DaemonSet should have similar memory usage, so they pick a memory limit based on some buffer over the median memory usage or something. But GMP's collector pods appear to have heterogenous memory usage, so they may not work well with VerticalPodAutoscaler, leading to OOMKilled crashlooping.

TheSpiritXIII commented 2 years ago

One difference managed collection and self-deployed collection will have is that GKE Autopilot should not change limits on the managed collection version.

One thing you can try is changing the cluster-autoscaler.kubernetes.io/safe-to-evict annotation to false here which should prevent the pod from being scaling down.

I will engage with the GKE Autoscaling team to see if they can do anything.

pintohutch commented 1 year ago

@willchan - when you mention the "default VerticalPodAutoscaler" in autopilot clusters, where is that documented?

When I create an autopilot cluster, i don't see any provisioned

% k get verticalpodautoscalers.autoscaling.k8s.io -A
No resources found

willchan commented 1 year ago

Hm, I may have misunderstood when the docs says that VerticalPodAutoscaler is enabled by default. On a re-read, I think maybe it just means that's enabled as a cluster feature, but is optional. So the Autopilot/VPA part of my description may be a red herring, and it's just that some of the collector Pods hit the limit and get OOMKilled, whereas the other collector Pods fit within the same limit.

What's the correct behavior here then? Should I just raise the memory request to be comfortable above the max collector memory usage?

pintohutch commented 1 year ago

Yes, that's my understanding as well.

Deploying GMP collectors as a DaemonSet mitigates what we've been calling "shard-overload", but it doesn't fix the problem entirely. It's completely possible to overwhelm a particular Prometheus collector on a given node due to uneven scheduling of workloads.

This is a fundamental trade-off in the design in GMP. Though I'd expect the same behavior, or worse, in any Prometheus deployment pattern on K8s.

The only workaround I can think of for this kind of issue would be to inject Prometheus as a sidecar container into pods in the cluster, using something like an annotation (similar to what Istio does). Though this may have undesired effects and is not in scope for design for GMP anytime soon.

As you suggest, our recommendation is either:

Find the "high watermark" of resource usage and set that in the DaemonSet CPU/RAM requests/limits field.
Deploy a VPA to target the collector DaemonSet. We've actually been just exploring this option recently and it could be useful. If you go this route, let us know how it goes!

Given this information, is there any other action items to take before resolving this issue?

willchan commented 1 year ago

I have more thoughts here, but I agree we can close this issue.

I've been using a combination of (1) and (2), where I deploy a VPA with /spec/resourcePolicy/0/minAllowed/memory set to the high watermark. It works. It just feels inefficient, since the same resource request gets applied to all Pods in the DaemonSet. I have to use the minAllowed, because in practice, the VPA seems to learn a memory limit that is below the high watermark. Arguably, that's a bug in the VPA though. It seems to assume that all Pods should have equivalent CPU/RAM requirements, whereas the collector's memory utilization is dependent on the other workloads on the same node.

That said, from VPA's perspective, perhaps this is a bug in the collector. Why does the memory usage vary by Pod/Node? Actually, why is the memory usage high at all? Is buffering required in the architecture somewhere? If the scraping is streamed to GCM, you'd think the memory utilization is low. Is buffering required for an important reason (like handling GCM API retries)?

pintohutch commented 1 year ago

That said, from VPA's perspective, perhaps this is a bug in the collector. Why does the memory usage vary by Pod/Node?

Because a Prometheus scraping 10k time series uses more memory than a Prometheus scraping 1k timeseries.

This comes down to the types of workloads you're running and how they get scheduled on your cluster. If the scheduler unevenly allocates workloads that are emitting metrics among nodes, then collector resource usage will, in kind, be uneven.

The collectors are deployed as a DaemonSet. AFAIK there's no way to scale resource requests/limits on individual pods within a DaemonSet independently based on usage.

Actually, why is the memory usage high at all? Is buffering required in the architecture somewhere? If the scraping is streamed to GCM, you'd think the memory utilization is low. Is buffering required for an important reason (like handling GCM API retries)?

There are a few reasons for the extra use in memory (this is highlighted in the docs):

Converting Prometheus samples and serializing them to their GCM protobuf equivalents before sending over the wire to GCM is memory intensive.
GMP Prometheus collectors have to cache the metadata and offsets for scraped timeseries to inform the GCM protobuf APIs (e.g. start time) that are written to GCM/Monarch.
Samples are cached to batch together in write calls as per best practices, though they are flushed periodically to avoid head-of-line blocking.
Note - Prometheus Remote Write also uses more memory than a conventional, stateful instance.

Does that make sense?

willchan commented 1 year ago

That said, from VPA's perspective, perhaps this is a bug in the collector. Why does the memory usage vary by Pod/Node?

Because a Prometheus scraping 10k time series uses more memory than a Prometheus scraping 1k timeseries.

This comes down to the types of workloads you're running and how they get scheduled on your cluster. If the scheduler unevenly allocates workloads that are emitting metrics among nodes, then collector resource usage will, in kind, be uneven.

Yeah, totally agreed with you here on the heterogeneity of Prometheus as is, was more surmising as to VPA's perspective. I just looked it up and indeed the docs suggest Vertical Pod autoscaling works best with long-running homogenous workloads.. It seems like when the Pod resource utilization is heterogenous, VPA won't work well. So maybe we shouldn't try VPA here.

The collectors are deployed as a DaemonSet. AFAIK there's no way to scale resource requests/limits on individual pods within a DaemonSet independently based on usage.

Actually, why is the memory usage high at all? Is buffering required in the architecture somewhere? If the scraping is streamed to GCM, you'd think the memory utilization is low. Is buffering required for an important reason (like handling GCM API retries)?

There are a few reasons for the extra use in memory (this is highlighted in the docs):

Converting Prometheus samples and serializing them to their GCM protobuf equivalents before sending over the wire to GCM is memory intensive.

GMP Prometheus collectors have to cache the metadata and offsets for scraped timeseries to inform the GCM protobuf APIs (e.g. start time) that are written to GCM/Monarch.

Samples are cached to batch together in write calls as per best practices, though they are flushed periodically to avoid head-of-line blocking.

Note - Prometheus Remote Write also uses more memory than a conventional, stateful instance.

Does that make sense?

Thanks for the detailed answer! I see I clearly made the typical mistake of assuming something could be simply addressed. Clearly, lots of thought has already been invested here.