GoogleCloudPlatform / prometheus

The Prometheus monitoring system and time series database. GCP fork to export to Google Cloud Managed Service for Prometheus. Main branch is kept at parity with upstream - see branches or tags for Google's additions.
https://g.co/cloud/managedprometheus
Apache License 2.0
37 stars 15 forks source link

Prometheus 2.47 #128

Open dh185221 opened 10 months ago

dh185221 commented 10 months ago

Proposal

Hi, do you know if there are plans to move gmp in line with prometheus 2.47 or higher?

The memory reduction noted in https://www.youtube.com/watch?v=29yKJ1312AM (which I think says will be available in 2.47) are important in our use case of running prometheus in environments where we want to keep memory usage at a minimum.

TheSpiritXIII commented 10 months ago

Hi, thanks for the interest!

Currently we are trying to get 2.43 out which is the last before the stringlabels optimization was turned on by default. Once that is out, you can build and enable this optimization for some memory savings.

After we get 2.43 out, we plan to release 2.45 which is the current upstream LTS.

With that said, we do not have plans to rebase on top of 2.47 at the moment. However, we do listen to the community so if this is something the community wants, we can prioritize it accordingly! For others who are interested in this, please add a reaction to the original comment.

Feel free to add any information you could give us about your workflow which results in increased memory usage and we can also consider this.

ak185158 commented 9 months ago

@TheSpiritXIII - We (@dh185221 and I) self deploy managed prometheus via prometheus-operator on our project in every cluster, collecting a number of custom metrics in addition to the standard array of kube-state-metrics and node-exporter metrics. While some of our infrastructure resides in GCP, there is an array of edge clusters that communicate with that infrastructure which necessitated our need to self manage our deployments and configuration.

On the edge side clusters especially, we are trying to reduce our footprint as much as possible as resources like memory can be constrained on customer hardware. We do have a handful of metrics that are used for diagnosing and monitoring connection traffic both internally and to/from external connections. Unfortunately, the cardinality of such metrics can cause a significant impact of the resource usage for Prometheus. We continue to work to identify and reduce metric cardinality where possible, and look for other ways we can improve the efficiency of Prometheus running in these scenarios.

While the upper end of larger clusters is a critical point we monitor and work to address, the average memory usage of Prometheus is still larger than we'd prefer, even in smaller clusters, and the supposed improvements the newer Prometheus builds promise the possibility of helping us in our goal of reducing our monitoring resource footprint.

Originally we were using Prometheus builds with a stackdriver sidecar to send the data to GCP, but switched to the managed-prometheus version for a number of improvements around pushing those metrics to GCP Monitoring, as well as billing costs reduction for doing so. This has worked well, with our only issue being around the slow uptake of Prometheus core source changes, particularly when new builds have been out for awhile and contain significant improvements or address issues we've encountered in the past. This also affects how quickly we can update other dependencies like Prometheus-Operator where newer versions of the CRDs no longer support the older Prometheus version we're limited by.

We can understand the concern Google has for stability and limited resources to update based on the schedule of dependencies being updated, and of course we wouldn't expect every release of Prometheus to be pulled into managed Prometheus, but 2.41.0 Prometheus core was released over a year ago. Surely we can find a middle ground of updating the Prometheus core a little more often than that, particularly when new features, bug fixes or efficiency improvements make the effort worthwhile. The version Dale mentions above seems like a significant improvement on the efficiency front that would be worth looking into to at least consider it for a subsequent release and placement on the roadmap. This would at least help us better evaluate our own roadmaps on when we might be able to take advantage of such changes.

Thanks!

ak185158 commented 6 months ago

@TheSpiritXIII Any update on consideration/roadmap for moving to a 2.47+ version of the Prometheus source? The touting of a close to 50% reduction in memory usage overhead is very appealing and critical to our continued use of ManagedPrometheus.

https://thenewstack.io/30-pull-requests-later-prometheus-memory-use-is-cut-in-half/

We're currently evaluating other avenues due to the massive overhead we see with our metric volume in larger clusters as we must significantly reduce our resource usage footprint. If there hasn't been a reconsideration of moving to 2.47 release or later of the source, we're going to have to consider other options for our metric collection needs (e.g., opentelemtry collector).

TheSpiritXIII commented 6 months ago

Hey @ak185158 the next Prometheus LTS version should be released around June/July. We will definitely update our fork to accommodate the next LTS version when it's available (Prometheus 2.53?).

The current latest version in GCR is v2.45.3-gmp.1-gke.0. Please take a look at that if you haven't already!

Some other solutions:

I'm unsure if we'll be able to get something out earlier since we're a small team. On the Prometheus side, we're currently focusing our efforts around native histograms and remote write v2. The latter might have some additional benefits there, as you'll hypothetically be able to enable agent mode. We're also planning on supporting prometheus-operator CRDs in Google Managed Service for Prometheus (via conversions) sometime this year. Our managed service already uses DaemonSets so each Prometheus instance tends to be small enough to handle node load.

I know it's not an answer you want to hear. You can also volunteer to add support to our repositories and we'd be happy to code review and get your changes merged!

vm250200 commented 4 months ago

Hey @ak185158 the next Prometheus LTS version should be released around June/July. We will definitely update our fork to accommodate the next LTS version when it's available (Prometheus 2.53?).

The current latest version in GCR is v2.45.3-gmp.1-gke.0. Please take a look at that if you haven't already!

Some other solutions:

  • You can use metric relabeling to drop labels, and see if that reduces your footprint at all.
  • You mentioned you are using prometheus-operator. They are planning on adding replicas or DaemonSet support, which would help here.
  • Maybe other ways you could split the load? e.g. could you potentially have 2 prometheus-operators running or maybe a dedicated Prometheus deployment for any heavy applications?

I'm unsure if we'll be able to get something out earlier since we're a small team. On the Prometheus side, we're currently focusing our efforts around native histograms and remote write v2. The latter might have some additional benefits there, as you'll hypothetically be able to enable agent mode. We're also planning on supporting prometheus-operator CRDs in Google Managed Service for Prometheus (via conversions) sometime this year. Our managed service already uses DaemonSets so each Prometheus instance tends to be small enough to handle node load.

I know it's not an answer you want to hear. You can also volunteer to add support to our repositories and we'd be happy to code review and get your changes merged!

Hey @TheSpiritXIII , As we are in June right now, Do we have any update on the next Prometheus LTS Version release date ?

TheSpiritXIII commented 4 months ago

@vm250200 hey again! 👋

The Prometheus team announced today that they are releasing 2.53 as their next LTS version and ending support for 2.45 in July. As such, we'll aim to rebase this fork for July as well.

We'll link the PR to this issue so you'll get updated as we make it available. Thanks for being patient!

vm250200 commented 2 months ago

@TheSpiritXIII Hey 👋🏼

When are you planning to review and merge https://github.com/GoogleCloudPlatform/prometheus/pull/193 and deploy the changes ?

TheSpiritXIII commented 2 months ago

Hello @vm250200, #193 is not currently a priority for the team.

In the meantime, the PR is available and minimally tested but passing integration tests. If you have Artifact Registry enabled on your GKE project, you can clone the branch and push a Docker image to your project which you can then use with prometheus-operator:

docker build -t location-docker.pkg.dev/project/repository/prometheus:v2.53.1.
docker push location-docker.pkg.dev/project/repository/prometheus:v2.53.1.