Query fails past certain point in time: expanding series: not found

jakubgs commented 3 years ago

Issue

When I make a query for 21 days for a metric I get back a result without issues:

 > curl -sv 'http://localhost:9092/prometheus/api/v1/query_range?query=some_metric&start=1604583000&end=1606397400&step=900' \
      | jq '.data.result | length'
8

But when I increase the query timerange to 22 days it fails horribly with a 500 error:

 > curl -s 'http://localhost:9092/prometheus/api/v1/query_range?query=some_metric&start=1604496600&end=1606397400&step=900' 
{"status":"error","errorType":"internal","error":"expanding series: not found"}%

Logs

The query-frontend shows this in logs:

caller=retry.go:71 msg="error processing request" try=0 err="rpc error: code = Code(500) desc = {\"status\":\"error\",\"errorType\":\"internal\",\"error\":\"expanding series: not found\"}"
caller=retry.go:71 msg="error processing request" try=1 err="rpc error: code = Code(500) desc = {\"status\":\"error\",\"errorType\":\"internal\",\"error\":\"expanding series: not found\"}"
caller=retry.go:71 msg="error processing request" try=2 err="rpc error: code = Code(500) desc = {\"status\":\"error\",\"errorType\":\"internal\",\"error\":\"expanding series: not found\"}"
caller=retry.go:71 msg="error processing request" try=3 err="rpc error: code = Code(500) desc = {\"status\":\"error\",\"errorType\":\"internal\",\"error\":\"expanding series: not found\"}"
caller=retry.go:71 msg="error processing request" try=4 err="rpc error: code = Code(500) desc = {\"status\":\"error\",\"errorType\":\"internal\",\"error\":\"expanding series: not found\"}"

Debug log level doesn't show anything more than that.

The cortex instances running with all do not print any errors or warnings at this time.

Setup

Cortex: 1.5.0, binary Cassandra: 3.11.9, binary Storage: Chunks

Configuration

Here are example configurations of my nodes:

cortex running all - config
cortex running query-frontend - config

Questions

What does expanding series: not found mean?
What metrics should I look at to debug this?
How can I debug this further?

jakubgs commented 3 years ago

And I'm pretty sure it's not lack of resources because the hosts are under-utilized if we look at CPU/RAM for Cortex nodes:

cortex_low_cpu_usage

cortex_mem_cpu_usage

Under 15% CPU utilization and ~11GB memory free most of the time. So I'm pretty sure it's not the resources. Same can be said for Cassandra, which is even lower:

cassandra_low_cpu_usage

So it's clearly something about my configuration that is under-utilizing the hardware available and causing these longer queries to fail.

stale[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had any activity in the past 60 days. It will be closed in 15 days if no further activity occurs. Thank you for your contributions.

Gaozizhong commented 1 year ago

Has anyone solved this problem?

jakubgs commented 1 year ago

Yes. I solved it by taking our Apache Cassadra cluster behind the shed and putting it out of its myssery.

The S3 backend works much better.

cortexproject / cortex