GoogleCloudPlatform / prometheus-engine

Google Cloud Managed Service for Prometheus libraries and manifests.
https://g.co/cloud/managedprometheus
Apache License 2.0
196 stars 92 forks source link

Collector resource requests are very low #964

Closed Harmelodic closed 5 months ago

Harmelodic commented 6 months ago

We were looking into the resource usage of Pods across our cluster and saw that the collector DaemonSet has quite low resource requests, and the Pods are consistently using well-above those requests.

Looking into the DaemonSet config, we can see that the requested resources for the prometheus container in the collector Pods are:

resources:
  limits:
    memory: 3G
  requests:
     cpu: 4m
     memory: 32M

This results in the "CPU Request % Used" and the "Memory Request % Used" graphs in GKE showing very high usage percentages:

image

This is not optimal.

Please consider making resource requests configurable per cluster, or be more dynamic (using a VPA).

Incidentally, we noticed that there is a VPA example in this repo - however this example does not cover CPU and doesn't appear to be applied (when looking through the Manifests as part of the GCP Managed Prometheus documentation)

pintohutch commented 6 months ago

Hi @Harmelodic,

Thanks for reaching out.

This is not optimal.

Are you saying the graphs are not optimal? Or the YAML spec itself is not optimal? And can you elaborate on what you mean by "optimal"?

Please consider making resource requests configurable per cluster, or be more dynamic (using a VPA).

Sure - https://github.com/GoogleCloudPlatform/prometheus-engine/pull/943. This will be released in the coming weeks on GKE. Note: there are a few gotchas with using VPAs on a DaemonSet experiencing heterogenous resource usage, which is why it's an opt-in feature.

Incidentally, we noticed that there is a VPA example in this repo - however this example does not cover CPU and doesn't appear to be applied (when looking through the Manifests as part of the GCP Managed Prometheus documentation)

See above answer. If you're interested, you can definitely configure the VPA example to adjust according to CPU usage cc @bernot-dev

Hope that helps.

Harmelodic commented 6 months ago

Heyo! Thanks for the info! 🤗

Are you saying the graphs are not optimal? Or the YAML spec itself is not optimal? And can you elaborate on what you mean by "optimal"?

Yeah, kinda assumed that the previous information explained that by inference. My bad ❤️

What I mean is that, since the resource requests on the DaemonSet are so low (4m & 32M), but the usage by the Pod is so high (relatively), it introduces cluster scaling issues and workload prioritisation issues.

By cluster scaling issues, I mean basically that if we want to scale our cluster effectively, it is important to set resource requests on all workloads, where the requested resource is pretty much what the container needs to fulfil it's job (plus a little bit of a buffer for flexibility). When GKE wants to scale nodes, it currently just wants to scale when it can't schedule a Pod because of lacking resources on a Node, based on requested resources. Meaning, if there is something using more than the requested resource, it is more likely to throttle that resource before it scales the cluster. Having the collector at such a low resource request for CPU & memory, means that it's going to be throttled massively before the cluster scales, resulting in the pod being OOMKilled (when over the memory request) and/or generally very slow log collection (when over the CPU request), until the cluster then scales up the number of nodes, applications are rescheduled and resources are made available for the collectors to do their job.

By prioritisation issues, it's probably worth giving some context/examples:

Let's say we have a cluster that is packed full of applications that use a bunch of CPU and memory and produce a large amount of metrics, and on that cluster is Managed Prometheus (with this collector DaemonSet).

In this context, it leads to:

There are then two scenarios I need to think about:

Scenario 1: I prioritise my applications over metric collection On a particular node, I might prioritise my applications receiving as much of the node's resources available, and so care less that the collector is CPU-throttled and limited in memory - even to the point where I don't mind if the collector is OOMKilled (and thus drops metric). Given this case, it is beneficial if the collector has a very low CPU request (for the node to throttle the collector Pod down to point of basically "idle" requirements) and a reasonable limit on memory - this effectively the case now.

Scenario 2: I prioritise metric collection over my applications On a particular node, I might prioritise metric collection above my applications receiving as much of the node's resources available, and so care more that the collector is not CPU-throttled and has a high limit on memory. If my applications are throttled slightly, but are still receiving what they request (as is guaranteed with Kubernetes), then that's fine - as long as all the metrics are being collected and the collector container effectively never gets OOMKilled. Given this case, the requested CPU should be more inline with the actual CPU usage of the collector container, or higher, to ensure the collector is not constantly being throttled, and a higher requested memory is set (ideally, the same as the limit) to ensure the collector never gets OOMKilled.

Out of these two scenarios (at least of the use cases that I'm dealing with) there are very few use cases where my applications getting most/all available node resources is a higher priority than ensuring all metrics get collected - especially since the attractiveness of Managed Prometheus as a product is "I don't have to worry about metric collection or storage". Therefore, the current resource configuration is not optimal.

Since you, as Google folks, cannot define what a Managed-Prometheus-customer's prioritisations should be, minimising the resource requests for the collection is understandable, since the customer might want to prioritise their applications over metric collection.

However, for Managed-Prometheus-customers that are prioritising metric collection over application resource usage, it makes using Managed Prometheus as a solution for metric collection & storage a bittersweet one, since metrics aren't guaranteed to be collected and so those customers need to spend time thinking about and working with what was supposed to be a managed solution.


Possible improvements that come to mind:

Sure - https://github.com/GoogleCloudPlatform/prometheus-engine/pull/943. This will be released in the coming weeks on GKE. Note: there are a few gotchas with using VPAs on a DaemonSet experiencing heterogenous resource usage, which is why it's an opt-in feature.

Again, thanks for linking to the PR! Once it gets released into Regular channel, I'm sure we'll take a look at the VPA.

Incidentally, I'm not sure what you mean by "heterogenous resource usage"? Also, what are the gotchas?

bernot-dev commented 6 months ago

There are known limitations of Vertical Pod Autoscaling.

The "heterogeneous resource usage" comment refers to situations where load is not distributed equally across the pods where the workload is being autoscaled. In a Prometheus context, image that you have a cluster where only some of your workloads expose metrics, and those workloads are all scheduled on a single node. In that instance, the Prometheus pod on the node where all of the metrics are being scraped could be working hard (high CPU/memory usage), while the Prometheus pods on other nodes would be essentially idle. This would be a poor fit for Vertical Pod Autoscaling over the DaemonSet of collectors because the same recommendation would apply to all of the pods in the DaemonSet, even though they have unequal load. In extreme instances, this problem could also result in pods becoming unschedulable.

Some of the known limitations have plans to be addressed, and it's possible we could see better performance across our use cases in the future through enhancements to Kubernetes. For now, we cannot guarantee that VPA will result in better results for all of our customers, so we are leaving the feature opt-in.

Harmelodic commented 6 months ago

There are known limitations of Vertical Pod Autoscaling.

The "heterogeneous resource usage" comment refers to situations where load is not distributed equally across the pods where ...

Thank you very much for this info! 🙏

Incidentally, ...

In that instance, the Prometheus pod on the node where all of the metrics are being scraped could be working hard (high CPU/memory usage), while the Prometheus pods on other nodes would be essentially idle.

Thankfully, we are not in this instance, since (a) all workloads produce metrics and (b) we spread our workloads as evenly as possible across the cluster. So, we'll definitely take a look at the VPA as soon as we have it available to our cluster 👍

However, whilst we don't have the VPA right now, and in case the VPA isn't a fit (for a different reason) or at least to satisfy other customers' needs, it'd still be good to consider making some improvements to the existing resource configuration 😊 ❤️

bernot-dev commented 6 months ago

For GMP users, there is no "one size fits all" solution for resource configuration. GMP is on by default when you create a GKE cluster, but some customers do not use Prometheus metrics or do not use GMP. Of those who do use GMP, they range from very low usage to extremely high usage. Any static resource request level we set will be suboptimal for some customers, either requesting too much or too little.

Setting resource requests too high is problematic because it could artificially displace user workloads, and the effect is multiplied over the number of nodes because the collectors run in a DaemonSet. As an extreme example, if we set the collector to request 1 CPU and 1 GB of memory, it would actually take up 3 CPUs and 3GB of memory on a small 3-node cluster, which would be unacceptable for all but heavy users of GMP, and may not meet their needs, either. This problem persists across different request levels, with different proportions of users being affected. We ultimately chose to set requests to what we would expect the GMP components to consume at idle, because it will never waste resources. In most cases, allowing "bursting" (using resources greater than the requested amounts) is preferable when the nodes are not at full capacity.

VPA can be a step in the right direction for some workloads, and we are continuing to explore additional options that will allow us to better meet the varied needs of our users.