cortexproject / cortex

A horizontally scalable, highly available, multi-tenant, long term Prometheus.
https://cortexmetrics.io/
Apache License 2.0
5.47k stars 795 forks source link

Periodic consistency check errors during queries with tenant federation enabled #5365

Open blovett opened 1 year ago

blovett commented 1 year ago

Describe the bug

We periodically see errors claiming consistency checks failed when making queries with tenant federation enabled. The blocks that it reports issues with are not a part of the tenant that has the data.

When this happens, we get messages like this in the query frontend logs:

[pod/v1-cortex-query-frontend-cfc6c9d69-t8w9j/query-frontend] level=debug ts=2023-05-26T17:46:54.654382213Z caller=results_cache.go:374 traceID=5a06f1b765aa8748 msg="handle miss" start=1683676800000 spanID=692181cda7283b7f
[pod/v1-cortex-query-frontend-cfc6c9d69-t8w9j/query-frontend] level=error ts=2023-05-26T17:46:54.759380243Z caller=retry.go:79 traceID=5a06f1b765aa8748 msg="error processing request" try=0 err="rpc error: code = Code(500) desc = {\"status\":\"error\",\"errorType\":\"internal\",\"error\":\"expanding series: error querying tenant_id fake: consistency check failed because some blocks were not queried: 01H02N308MZ51VVG77WAZXWHRR\"}"
[pod/v1-cortex-query-frontend-cfc6c9d69-t8w9j/query-frontend] level=error ts=2023-05-26T17:46:54.901137753Z caller=retry.go:79 traceID=5a06f1b765aa8748 msg="error processing request" try=1 err="rpc error: code = Code(500) desc = {\"status\":\"error\",\"errorType\":\"internal\",\"error\":\"expanding series: error querying tenant_id fake: consistency check failed because some blocks were not queried: 01H02N308MZ51VVG77WAZXWHRR\"}"
[pod/v1-cortex-query-frontend-cfc6c9d69-t8w9j/query-frontend] level=error ts=2023-05-26T17:46:54.954194856Z caller=retry.go:79 traceID=5a06f1b765aa8748 msg="error processing request" try=2 err="rpc error: code = Code(500) desc = {\"status\":\"error\",\"errorType\":\"internal\",\"error\":\"expanding series: error querying tenant_id fake: consistency check failed because some blocks were not queried: 01H02N308MZ51VVG77WAZXWHRR\"}"
[pod/v1-cortex-query-frontend-cfc6c9d69-t8w9j/query-frontend] level=error ts=2023-05-26T17:46:55.018502766Z caller=retry.go:79 traceID=5a06f1b765aa8748 msg="error processing request" try=3 err="rpc error: code = Code(500) desc = {\"status\":\"error\",\"errorType\":\"internal\",\"error\":\"expanding series: error querying tenant_id fake: consistency check failed because some blocks were not queried: 01H02N308MZ51VVG77WAZXWHRR\"}"
[pod/v1-cortex-query-frontend-cfc6c9d69-t8w9j/query-frontend] level=error ts=2023-05-26T17:46:55.401956233Z caller=retry.go:79 traceID=5a06f1b765aa8748 msg="error processing request" try=4 err="rpc error: code = Code(500) desc = {\"status\":\"error\",\"errorType\":\"internal\",\"error\":\"expanding series: error querying tenant_id fake: consistency check failed because some blocks were not queried: 01H02N308MZ51VVG77WAZXWHRR\"}"
[pod/v1-cortex-query-frontend-cfc6c9d69-t8w9j/query-frontend] level=warn ts=2023-05-26T17:46:55.402152993Z caller=logging.go:86 traceID=5a06f1b765aa8748 msg="GET /prometheus/api/v1/query_range?query=customer:rts_BWbits:sum&start=1683676800&end=1683763200&step=300 (500) 747.968778ms Response: \"{\\\"status\\\":\\\"error\\\",\\\"errorType\\\":\\\"internal\\\",\\\"error\\\":\\\"expanding series: error querying tenant_id fake: consistency check failed because some blocks were not queried: 01H02N308MZ51VVG77WAZXWHRR\\\"}\" ws: false; Accept: */*; Connection: close; User-Agent: curl/7.88.1; X-Scope-Orgid: rts|fake; "

Whereas the successful query shows up like:

[pod/v1-cortex-query-frontend-cfc6c9d69-wh2ph/query-frontend] level=debug ts=2023-05-26T17:48:52.834813942Z caller=results_cache.go:374 org_id=rts traceID=20a4ac98a88c3d61 msg="handle miss" start=1683676800000 spanID=53ca2a15ef6f8322
[pod/v1-cortex-query-frontend-cfc6c9d69-wh2ph/query-frontend] level=debug ts=2023-05-26T17:48:52.939481535Z caller=logging.go:76 traceID=20a4ac98a88c3d61 msg="GET /prometheus/api/v1/query_range?query=customer:rts_BWbits:sum&start=1683676800&end=1683763200&step=300 (200) 105.020042ms"

To Reproduce

Steps to reproduce the behavior:

  1. Start Cortex 1.14.1
  2. Perform federated query

Expected behavior

I'd expect it to not error like this. I'm not sure what else to say.

Environment:

Additional Context

Storage gateway logs: https://gist.github.com/blovett/84b08f2608f3cccf2cf4865c485720db I also included logs above. But, if there are more that I can provide that could help troubleshoot this, please let me know.

blovett commented 1 year ago

We continue to see this issue intermittently. Any additional information we could provide?