cortexproject / cortex

A horizontally scalable, highly available, multi-tenant, long term Prometheus.
https://cortexmetrics.io/
Apache License 2.0
5.43k stars 788 forks source link

expanding series: consistency check failed because some blocks were not queried #5791

Closed anuragkdi closed 6 months ago

anuragkdi commented 6 months ago

Hi Folks,

We are seeing below errors in cortex while querying metrics > 12 hours in grafana.

QUERIER

expanding series: consistency check failed because some blocks were not queried: 01HQQ110QJNE4392CFZWM69K6Z

method=blocksStoreQuerier.selectSorted level=warn msg="unable to get store-gateway clients while retrying to fetch missing blocks" err="no store-gateway instance left after checking exclude for block 01HQQ110QJNE4392CFZWM69K6Z"

STORE GATEWAY

024-02-28T12:10:12+05:30 level=warn ts=2024-02-28T06:40:12.060087106Z caller=sharding_strategy.go:140 msg="failed to check block owner and block has been excluded because was not previously loaded" block=01HC9TMZAGJ28RXW73QKD43VNK err="inconsistent ring tokens information"
2024-02-28T12:10:12+05:30 level=warn ts=2024-02-28T06:40:12.060120669Z caller=sharding_strategy.go:140 msg="failed to check block owner and block has been excluded because was not previously loaded" block=01HD3SHHDV98NTCAXAPWPCW3RD err="inconsistent ring tokens information"
2024-02-28T12:10:12+05:30 level=warn ts=2024-02-28T06:40:12.060136819Z caller=sharding_strategy.go:140 msg="failed to check block owner and block has been excluded because was not previously loaded" block=01GPQ3J10WK2X2QAQJJAT6N4Y2 err="inconsistent ring tokens information"
2024-02-28T12:10:12+05:30 level=info ts=2024-02-28T06:40:12.060608181Z caller=gateway.go:332 msg="successfully synchronized TSDB blocks for all users" reason=periodic

META.JSON for block (seems the source is "compactor") 01HQQ110QJNE4392CFZWM69K6Z.json

CORTEX YAML

alertmanager:
  enable_api: false
  external_url: /api/prom/alertmanager
  storage: {}
api:
  prometheus_http_prefix: /prometheus
  response_compression_enabled: true
auth_enabled: true
blocks_storage:
  azure:
    account_key: 
    account_name: 
    container_name: 
  backend: azure
  bucket_store:
    bucket_index:
      enabled: true
    sync_dir: /data/tsdb-sync
  tsdb:
    dir: /data/tsdb
    retention_period: 15h0m0s
distributor:
  pool:
    health_check_ingesters: true
  shard_by_all_labels: true
frontend:
  log_queries_longer_than: 10s
ingester:
  lifecycler:
    final_sleep: 30s
    join_after: 0s
    num_tokens: 512
    observe_period: 0s
    ring:
      kvstore:
        store: memberlist
      replication_factor: 3
ingester_client:
  grpc_client_config:
    max_recv_msg_size: 10485760
    max_send_msg_size: 10485760
limits:
  enforce_metric_name: false
  ingestion_rate: 100000
  max_global_series_per_metric: 2000000
  max_label_names_per_series: 150
  max_label_value_length: 4096
  max_query_lookback: 0s
  max_series_per_metric: 0
  reject_old_samples: true
  reject_old_samples_max_age: 168h
memberlist:
  bind_port: 7946
  join_members:
  - cortex-memberlist
querier:
  active_query_tracker_dir: /data/cortex/querier
  store_gateway_addresses: cortex-store-gateway-headless:9095
  query_ingesters_within: 13h
  query_store_after: 12h
query_range:
  align_queries_with_step: true
  cache_results: true
  results_cache:
    cache:
      memcached:
        expiration: 1h
      memcached_client:
        timeout: 1s
  split_queries_by_interval: 24h
ruler:
  enable_alertmanager_discovery: false
  enable_api: false
  storage: {}
runtime_config:
  file: /etc/cortex-runtime-config/runtime_config.yaml
server:
  grpc_listen_port: 9095
  grpc_server_max_concurrent_streams: 10000
  grpc_server_max_recv_msg_size: 10485760
  grpc_server_max_send_msg_size: 10485760
  http_listen_port: 8080
storage:
  engine: blocks
  index_queries_cache_config:
    memcached:
      expiration: 1h
    memcached_client:
      timeout: 1s
store_gateway:
  sharding_enabled: true
  sharding_ring:
    kvstore:
      store: memberlist
compactor:
  sharding_enabled: true
  sharding_ring:
    kvstore:
      store: memberlist
tenant_federation:
  enabled: true

CORTEX RESOURCES:

cortex app version: 1.11.0 chart version: 1.3.0 Distributor (count): 15 Ingester (count): 25 Querier (count):2 Query Frontend (count): 2 Compactor (count):2 Store Gateway (count): 2 and RF:3

We are currently facing this in our production cluster (Day 2) and would appreciate if someone could assist on this?

anuragkdi commented 6 months ago

@friedrichg @alanprot @alvinlin123 @gramidt could you guys provide some insights on the issue?

friedrichg commented 6 months ago
# Comma separated list of store-gateway addresses in DNS Service Discovery
# format. This option should be set when using the blocks storage and the
# store-gateway sharding is disabled (when enabled, the store-gateway instances
# form a ring and addresses are picked from the ring).
# CLI flag: -querier.store-gateway-addresses
[store_gateway_addresses: <string> | default = ""]

The default is empty, you should leave it empty. Queriers should be able to find store-gateways because you have store-gateway sharding enabled.

yeya24 commented 6 months ago

inconsistent ring tokens information This is probably something new. This error was noted as should never happen unless there is a bug in the ring code.

Is this something that you can always reproduce?

anuragkdi commented 6 months ago

Latest Update: Yesterday, for a brief time we were able to query > 12 hours (number of store gateway was 1 at that time). We scaled up store gateway +1 after which due to load (high cpu on the node) the previous healthy store gateway was also restarted and the resync began again.

Today morning, both store gateway are stable and active in the ring. But i see the errors again where the querier is not able to get the blocks for > 12 hours. Below are the errors currently in store gateway ("inconsistent ring tokens information")

Is there a wait time before which store gateway would pick up the blocks or there is some fundamental issue with store gateway in this case?. Btw, sharding strategy is default in our store gateway atm

level=warn ts=2024-02-29T03:37:15.538243823Z caller=sharding_strategy.go:140 msg="failed to check block owner and block has been excluded because was not previously loaded" block=01GWCVEZC0VBCE7BHMCXW1M20N err="inconsistent ring tokens information"
level=warn ts=2024-02-29T03:37:15.538273037Z caller=sharding_strategy.go:140 msg="failed to check block owner and block has been excluded because was not previously loaded" block=01H42HHKP7EEK860YY9PFZC86J err="inconsistent ring tokens information"
level=warn ts=2024-02-29T03:37:15.538301611Z caller=sharding_strategy.go:140 msg="failed to check block owner and block has been excluded because was not previously loaded" block=01H61M8Y8NHW1CDJR067KP67DK err="inconsistent ring tokens information"
level=warn ts=2024-02-29T03:37:15.538317781Z caller=sharding_strategy.go:140 msg="failed to check block owner and block has been excluded because was not previously loaded" block=01GN7WZ0WAFMWTZX5027XTTAHZ err="inconsistent ring tokens information"
level=warn ts=2024-02-29T03:37:15.538334842Z caller=sharding_strategy.go:140 msg="failed to check block owner and block has been excluded because was not previously loaded" block=01GPEA1KVG7GTJY2PT3MRFV664 err="inconsistent ring tokens information"
level=warn ts=2024-02-29T03:37:15.538346925Z caller=sharding_strategy.go:140 msg="failed to check block owner and block has been excluded because was not previously loaded" block=01GRYJ38NB640SP5NBTPT8YTFC err="inconsistent ring tokens information"
level=warn ts=2024-02-29T03:37:15.538363306Z caller=sharding_strategy.go:140 msg="failed to check block owner and block has been excluded because was not previously loaded" block=01H2NPETCNCPH61NY2EMK8QJD1 err="inconsistent ring tokens information"
level=info ts=2024-02-29T03:37:15.539052098Z caller=gateway.go:332 msg="successfully synchronized TSDB blocks for all users" reason=periodic
anuragkdi commented 6 months ago

btw i changed the store_gateway_addresses to default as suggested

anuragkdi commented 6 months ago

seems like the error inconsistent ring tokens information states below in the code (as stated by @yeya24 )

// ErrInconsistentTokensInfo is the error returned if, due to an internal bug, the mapping between
    // a token and its own instance is missing or unknown.
    ErrInconsistentTokensInfo = errors.New("inconsistent ring tokens information")

How to fix this?

anuragkdi commented 6 months ago

should i upgrade the cortex version if that would help? - its currently at an older version 1.11.0

yeya24 commented 6 months ago

Idk if it would help. But you can give it a try.

anuragkdi commented 6 months ago

so i have scaled down store gateway to 1 right now to bypass if any bug in the ring propagation - It seems its loading the blocks and i don't see the errors - i will let the store gateway finish loading the blocks and share results here

anuragkdi commented 6 months ago

okay so below are the findings which should make things clear:

1) I could see metrics > 12 hours when a single store gateway was loading the blocks. 2) Unfortunately, due to max cpu utilization, the store gateway restarted . Below are the restart logs (we see not loading tokens from file, tokens file path is empty" )

level=info ts=2024-02-29T08:34:57.403415617Z caller=module_service.go:64 msg=initialising module=server
level=info ts=2024-02-29T08:34:57.403468083Z caller=module_service.go:64 msg=initialising module=memberlist-kv
level=info ts=2024-02-29T08:34:57.403498101Z caller=module_service.go:64 msg=initialising module=runtime-config
level=info ts=2024-02-29T08:34:57.404603714Z caller=module_service.go:64 msg=initialising module=store-gateway
level=info ts=2024-02-29T08:34:57.404845938Z caller=basic_lifecycler.go:251 msg="instance found in the ring" instance=cortex-store-gateway-0 ring=store-gateway state=ACTIVE tokens=512 registered_at="2024-02-28 11:58:15 +0000 UTC"
level=info ts=2024-02-29T08:34:57.404876496Z caller=basic_lifecycler_delegates.go:63 msg="not loading tokens from file, tokens file path is empty"
level=info ts=2024-02-29T08:34:57.40503789Z caller=gateway.go:230 msg="waiting until store-gateway is JOINING in the ring"
level=info ts=2024-02-29T08:34:58.269324832Z caller=memberlist_client.go:497 msg="joined memberlist cluster" reached_nodes=43
level=info ts=2024-02-29T08:35:12.736728922Z caller=gateway.go:234 msg="store-gateway is JOINING in the ring"
level=info ts=2024-02-29T08:35:12.736775039Z caller=gateway.go:244 msg="waiting until store-gateway ring topology is stable" min_waiting=1m0s max_waiting=5m0s

3) When the store gateway comes alive after the restart, we see inconsistent ring tokens information in the logs

Actions and suggestions:

1) I have now set the 'store-gateway.sharding-ring.tokens-file-path': '/data/tokens', 2) Will upgrade AKS to configure new nodes with higher spec 3) Does the new cortex version support no resync in store gateway if it restarts according to this fix : https://github.com/cortexproject/cortex/pull/5363

anuragkdi commented 6 months ago

after setting the token path - the store gateway (after restarts due to high cpu) is still showing me the error inconsistent ring tokens information

level=warn ts=2024-02-29T11:14:09.916716249Z caller=sharding_strategy.go:140 msg="failed to check block owner and block has been excluded because was not previously loaded" block=01HJNWQ1EGCHYJ6VKZQWHZ44KT err="inconsistent ring tokens information"
level=info ts=2024-02-29T11:15:20.478908396Z caller=bucket_stores.go:151 msg="successfully synchronized TSDB blocks for all users"
level=info ts=2024-02-29T11:15:20.479090568Z caller=gateway.go:271 msg="waiting until store-gateway is ACTIVE in the ring"
level=info ts=2024-02-29T11:15:20.598497927Z caller=gateway.go:275 msg="store-gateway is ACTIVE in the ring"
level=info ts=2024-02-29T11:15:20.59856337Z caller=cortex.go:436 msg="Cortex started"
anuragkdi commented 6 months ago

btw got it resolved by scheduling the store gateway pods on a higher spec node in AKS and allowed the resync to happen successfully (it restarted few times but continued the resync) for the 1st store gateway pod - gradually scaled up the store gateway thereafter.