GoogleCloudPlatform / prometheus-engine

Google Cloud Managed Service for Prometheus libraries and manifests.
https://g.co/cloud/managedprometheus
Apache License 2.0
195 stars 93 forks source link

Memory error on querying data for range of 2+ days #767

Closed maksym-iv closed 10 months ago

maksym-iv commented 10 months ago

Hello, I'm a bit new to the GCP managed Prometheus, however worked with self-managed prometheus for a while. Currently we ingest all metrics from 4 projects to a single one (aggregating the metrics). We have a prom-frontend set up in GKE alongside with Grafana. Recently I've noticed a strange error comming from both Grafana and prom-frontend (did a kubectl port-forward and queried prom UI directly to exclude Grafana from the equation) when doing a pretty trivial query (with period of 2 days):

kube_deployment_status_replicas_ready{env_group="stage", cluster="gke-main-a", deployment="foo"}

Errors noticed:

There is no crazy cardinality for the metric (31 in GCP UI).

Same query in UI succeed, but takes for 10+ seconds to show the results, which also seems to be a bit strange

I believe I'm missing something simple, however can't figure out what excatly, would appreciate any advice in solving this issue

lyanco commented 10 months ago

Heya - we are currently experiencing a partial degradation in querying, so this is likely related to that.

The issue mostly affects querying over data older than 25 hours, but there are knock-on effects which is causing slightly worse query performance for data fresher than 25 hours as well.

Will keep you updated.

lyanco commented 10 months ago

You can follow along here: https://status.cloud.google.com/incidents/ZvBMWa5Z8yhfCwbp5xTp#2c2sBHWU84yPDJ8y1ar4

maksym-iv commented 10 months ago

Closing since it's an GCP issue

bwplotka commented 10 months ago

Also it should be good now: https://status.cloud.google.com/incidents/ZvBMWa5Z8yhfCwbp5xTp#2c2sBHWU84yPDJ8y1ar4