Closed andreasmh closed 2 months ago
It's pretty unlikely we will support counter exemplars in the near term. In Monarch, exemplars are stored with the point itself, and only histograms have the necessary data structure (proto) for that. Counters and Gauges are stored far more efficiently and there's no place to put an exemplar. Changing this would likely require us building an entirely separate exemplar storage system, which isn't likely given that most of the need for exemplars is on histograms.
That being said, we should probably give you a way to ingest these metrics without exemplars (dropping the exemplar + ingesting the time series, instead of erroring) so that you don't have to modify the exporter to get these metrics. I'll open a FR for this, thank you for letting us know!
Is there any timeline for when managed prometheus will update to prometheus v 2.43 or later
To answer this part of the question, we are currently deploying Prometheus v2.43 out to the GKE Rapid release channel for new clusters within the next few days.
We believe the upgrade should suppress the error but the exemplars will be silently dropped, although we still need to test and verify this.
It's pretty unlikely we will support counter exemplars in the near term. In Monarch, exemplars are stored with the point itself, and only histograms have the necessary data structure (proto) for that. Counters and Gauges are stored far more efficiently and there's no place to put an exemplar. Changing this would likely require us building an entirely separate exemplar storage system, which isn't likely given that most of the need for exemplars is on histograms.
That being said, we should probably give you a way to ingest these metrics without exemplars (dropping the exemplar + ingesting the time series, instead of erroring) so that you don't have to modify the exporter to get these metrics. I'll open a FR for this, thank you for letting us know!
I think we can live with exemplars being dropped in that case. It is much better from our point of view to drop the exemplars rather than failing to scrape and not getting any metrics at all.
To answer this part of the question, we are currently deploying Prometheus v2.43 out to the GKE Rapid release channel for new clusters within the next few days.
We believe the upgrade should suppress the error but the exemplars will be silently dropped, although we still need to test and verify this.
Sounds great, I'm looking forward to it!
Thank you for the quick responses! ⭐
@TheSpiritXIII Can you clarify please how to get the updated Managed Prometheus v2.43 for an existing GKE cluster? Is it enough to just upgrade the GKE cluster to the latest version from the GKE Rapid release channel using the usual GKE cluster upgrade menu in the Cloud Console? Will the Managed Prometheus instance be upgraded to v2.43+ automaticlly?
And how to be notified when Managed Prometheus v2.43 will be available in the GKE Rapid release channel? Just to force the upgrade and check if it works.
Hi @xak2000! If you just want to test it, creating a brand new cluster on the latest GKE minor version using the Rapid release channel will always have the latest GMP version. Existing clusters on the latest GKE minor version are being upgraded randomly in these next few weeks (outside of our control). Our latest version requires 1.29.1-gke.1545000
or above. Hope that helps!
The cluster that I wanted to use to test it is on the Regular
channel right now. My plan was to switch it temporarily to the Rapid
channel, test it, then switch it back. It is not production cluster, so I can "play" with it.
But if the upgrade to the latest GMP version is not guaranteed with the switch of the release channel, then probably it's not worth it. Creating a new cluster, configuring the build environment etc. to deploy apps (with metrics exporters) that I want to test to the new cluster, etc. - could be too time-consuming.
I'll wait until 1.29.1-gke.1545000
+ will be promoted to the Regular
channel, then.
Thank you @TheSpiritXIII, this helped!
params:
format:
- text/plain
Appending this here, in case anyone else see this 🧵. Set the following params entry in PodMonitoring
and text/plain
is negotiated as Content-Type
and examplars are disabled completely. For our client, which relies on http_server_requests_seconds_count
for HPA
- this workaround will suffice for now. However, one might really wish a product which is labeled and sold as Prometheus would actually also maintain feature compatibility with the Prometheus.
I'm a bit puzzled here since exemplars for Counter
has been supported by Prometheus since the introduction of exemplars. On the other hand, exemplars on _count
for Summary
and Histogram
was not: https://github.com/micrometer-metrics/micrometer/pull/3996
Phrasing it differently: exemplars were supported only on Counters
and Histogram
buckets.
I guess GKE managed Prometheus supports exemplars on Counter
but does not support exemplars on _count
for Summary
and Histogram
. This conflicts with what @lyanco said above, can somebody confirm?
Does anyone know why an unsupported version of Prometheus is used in GKE? 2.50.1 is the current Prometheus Server version. Does GKE offer its own Prometheus build with a separate support lifecycle?
Regarding why we don't support Counter exemplars - this might not be a satisfying answer, but we intentionally don't aim to maintain 100% compatibility with OSS Prometheus. Our goal is to support the vast majority of use cases, while strategically compromising on some lesser-used use cases so that we can continue to use the power and scale of Monarch to back the system. Supporting every single Prometheus function would require building a backend from complete scratch, and doing that would preclude us from supporting other highly-requested non-Prometheus flows such as querying Cloud Monitoring metrics with PromQL and exporting data from managed Prom to BigQuery. We do aim to become more conformant over time, but unfortunately Counter exemplars is one of those features that would be prohibitively expensive for us to add due to how deep the incompatibility is baked into Monarch.
Monarch supports exemplars on the individual histogram buckets, in this case it'd be the _bucket time series, which are only found on histogram metrics.
Due to current architecture constraints, we need to maintain and redistribute a small fork of Prometheus to support managed collection. Given our scale and our customers' reliability needs, this requires us to do a lot of qualification of each new OSS release, meaning we don't support every new version of Prometheus. Our current policy is to support every LTS release plus new versions that substantially change functionality plus whatever kube-prometheus pegs to. Once OSS prometheus supports remote write v2 or OTLP, we likely will be able to sunset our fork, at which point we can peg to head much more easily and quickly.
Hope this helps clarify.
Our goal is to support the vast majority of use cases, while strategically compromising on some lesser-used use cases so that we can continue to use the power and scale of Monarch to back the system.
Thank you very much for the explanation! I would like to call out that the lack of this (the exclusive support of Histogram buckets) makes it hard to the users to troubleshoot anything else than latency-related issues. E.g.: an error rate went up, give me an example trace for that time series to see what happened.
I might have diverted a conversion with this a little, sorry for my confusion. Let me go back to the original issue description. I think the title and the description of the issue might be misleading. Even if GCP's Prometheus fork does not support exemplars on Counter
s, I assume it does not drop the whole data set if it finds one but simply ignores it instead.
I think the problem is not with Counter
s but with the _count
time series for Summary
and Histogram
since Prometheus versions before 1.43 were dropping the whole dataset if they found an exemplar on _count
(basically on anything else other than Counter
and Histogram
buckets).
So I think the solution for the original issue is not supporting exemplars for Counter
s but not dropping the data set if there is a time series that has exemplars except histogram buckets (Prometheus 1.43+ behavior).
Does this make sense?
+1 - agreed on your assessment, upon further review it looks like once 2.45 is rolled out this issue will fix itself automatically. We're still qualifying it... should be soon though.
Also - to add to my previous statement about pegging to new versions - we are Fedramp compliant w.r.t. patching CVEs and vulnerabilities. This all happens behind the scenes.
Is there any workaround until the fix is provided to drop the exemplars metrics and Prometheus can scrape other metrics from the target.
We are using self deployed collection (v2.41.0-gmp.9-gke.0) and have tried dropping metrics via metricRelabelings
and/or relabelings
in ServiceMonitor
but it's not helping. Any suggestions.
metricRelabelings: # happens after the scrape
- action: drop # drop exemplars metrics not supported by GCP prometheus
regex: 'spring_kafka_(template|listener)_seconds_count|hikaricp_connections_(acquire|usage)_seconds_count'
sourceLabels: [ __name__ ]
relabelings: # happens before the scrape
- action: drop # drop exemplars metrics not supported by GCP prometheus
regex: 'spring_kafka_(template|listener)_seconds_count|hikaricp_connections_(acquire|usage)_seconds_count'
sourceLabels: [ __name__ ]
@GitKaran hello! Prometheus v2.43.1-gmp.0-gke.0
and v2.45.3-gmp.0-gke.0
images are publicly available (although not publicly released outside of the GKE Rapid release channel yet). Can you try one of those images without the workaround?
Thanks @TheSpiritXIII I deployed v2.43.1-gmp.0-gke.0
and it works.
This should be resolved now on all the latest patches of all supported GKE minor versions. Please let us know if you have further issues. Thanks!
According to this page: https://cloud.google.com/stackdriver/docs/managed-prometheus/exemplars There is no support for counter exemplars: "Exemplars attached to counter metrics can't be ingested"
The problem is that we are using spring-boot, and they have added exemplars to other types: https://github.com/spring-projects/spring-boot/wiki/Spring-Boot-3.2-Release-Notes#broader-exemplar-support-in-micrometer-112
as this is available for all supported prometheus version.
This leads to us not being able to scrape any metrics when using managed prometheus. Port-forwarding to collectors pods, and checking the GUI, shows errors like this:
Is there any timeline for when managed prometheus will update to prometheus v 2.43 or later, and start supporting exemplars for other types than histograms?
This is currently preventing us from fully migrating away from our own prometheus to the managed one.