Memory error on querying data for range of 2+ days

maksym-iv commented 10 months ago

Hello, I'm a bit new to the GCP managed Prometheus, however worked with self-managed prometheus for a while. Currently we ingest all metrics from 4 projects to a single one (aggregating the metrics). We have a prom-frontend set up in GKE alongside with Grafana. Recently I've noticed a strange error comming from both Grafana and prom-frontend (did a kubectl port-forward and queried prom UI directly to exclude Grafana from the equation) when doing a pretty trivial query (with period of 2 days):

kube_deployment_status_replicas_ready{env_group="stage", cluster="gke-main-a", deployment="foo"}

Errors noticed:

In both prom-frontend and Grafana

Error executing query: expanding series: generic::aborted: User /UNSPECIFIED:cloud-monitoring-query/UNSPECIFIED:gcm-api/CONSUMER_RESOURCE_CONTAINER:0 has requested 5853MiB of memory for processing queries on one Monarch node (limit 5847MiB), refusing to grant further memory for this query.

In Grafana

Status: 500. Message: internal: expanding series: generic::aborted: invalid status monarch::220: Cancelled due to the number of queries whose evaluation is blocked waiting for memory is 502, which is equal to or greater than the limit of 500.

There is no crazy cardinality for the metric (31 in GCP UI).

Same query in UI succeed, but takes for 10+ seconds to show the results, which also seems to be a bit strange

I believe I'm missing something simple, however can't figure out what excatly, would appreciate any advice in solving this issue

lyanco commented 10 months ago

Heya - we are currently experiencing a partial degradation in querying, so this is likely related to that.

The issue mostly affects querying over data older than 25 hours, but there are knock-on effects which is causing slightly worse query performance for data fresher than 25 hours as well.

Will keep you updated.

lyanco commented 10 months ago

You can follow along here: https://status.cloud.google.com/incidents/ZvBMWa5Z8yhfCwbp5xTp#2c2sBHWU84yPDJ8y1ar4

maksym-iv commented 10 months ago

Closing since it's an GCP issue

bwplotka commented 10 months ago

Also it should be good now: https://status.cloud.google.com/incidents/ZvBMWa5Z8yhfCwbp5xTp#2c2sBHWU84yPDJ8y1ar4

GoogleCloudPlatform / prometheus-engine

Memory error on querying data for range of 2+ days #767