Closed anuragkdi closed 6 months ago
@friedrichg @alanprot @alvinlin123 @gramidt could you guys provide some insights on the issue?
# Comma separated list of store-gateway addresses in DNS Service Discovery
# format. This option should be set when using the blocks storage and the
# store-gateway sharding is disabled (when enabled, the store-gateway instances
# form a ring and addresses are picked from the ring).
# CLI flag: -querier.store-gateway-addresses
[store_gateway_addresses: <string> | default = ""]
The default is empty, you should leave it empty. Queriers should be able to find store-gateways because you have store-gateway sharding enabled.
inconsistent ring tokens information
This is probably something new. This error was noted as should never happen unless there is a bug in the ring code
.
Is this something that you can always reproduce?
Latest Update: Yesterday, for a brief time we were able to query > 12 hours (number of store gateway was 1 at that time). We scaled up store gateway +1 after which due to load (high cpu on the node) the previous healthy store gateway was also restarted and the resync began again.
Today morning, both store gateway are stable and active in the ring. But i see the errors again where the querier is not able to get the blocks for > 12 hours. Below are the errors currently in store gateway ("inconsistent ring tokens information"
)
Is there a wait time before which store gateway would pick up the blocks or there is some fundamental issue with store gateway in this case?. Btw, sharding strategy is default
in our store gateway atm
level=warn ts=2024-02-29T03:37:15.538243823Z caller=sharding_strategy.go:140 msg="failed to check block owner and block has been excluded because was not previously loaded" block=01GWCVEZC0VBCE7BHMCXW1M20N err="inconsistent ring tokens information"
level=warn ts=2024-02-29T03:37:15.538273037Z caller=sharding_strategy.go:140 msg="failed to check block owner and block has been excluded because was not previously loaded" block=01H42HHKP7EEK860YY9PFZC86J err="inconsistent ring tokens information"
level=warn ts=2024-02-29T03:37:15.538301611Z caller=sharding_strategy.go:140 msg="failed to check block owner and block has been excluded because was not previously loaded" block=01H61M8Y8NHW1CDJR067KP67DK err="inconsistent ring tokens information"
level=warn ts=2024-02-29T03:37:15.538317781Z caller=sharding_strategy.go:140 msg="failed to check block owner and block has been excluded because was not previously loaded" block=01GN7WZ0WAFMWTZX5027XTTAHZ err="inconsistent ring tokens information"
level=warn ts=2024-02-29T03:37:15.538334842Z caller=sharding_strategy.go:140 msg="failed to check block owner and block has been excluded because was not previously loaded" block=01GPEA1KVG7GTJY2PT3MRFV664 err="inconsistent ring tokens information"
level=warn ts=2024-02-29T03:37:15.538346925Z caller=sharding_strategy.go:140 msg="failed to check block owner and block has been excluded because was not previously loaded" block=01GRYJ38NB640SP5NBTPT8YTFC err="inconsistent ring tokens information"
level=warn ts=2024-02-29T03:37:15.538363306Z caller=sharding_strategy.go:140 msg="failed to check block owner and block has been excluded because was not previously loaded" block=01H2NPETCNCPH61NY2EMK8QJD1 err="inconsistent ring tokens information"
level=info ts=2024-02-29T03:37:15.539052098Z caller=gateway.go:332 msg="successfully synchronized TSDB blocks for all users" reason=periodic
btw i changed the store_gateway_addresses
to default as suggested
seems like the error inconsistent ring tokens information
states below in the code (as stated by @yeya24 )
// ErrInconsistentTokensInfo is the error returned if, due to an internal bug, the mapping between
// a token and its own instance is missing or unknown.
ErrInconsistentTokensInfo = errors.New("inconsistent ring tokens information")
How to fix this?
should i upgrade the cortex version if that would help? - its currently at an older version 1.11.0
Idk if it would help. But you can give it a try.
so i have scaled down store gateway to 1 right now to bypass if any bug in the ring propagation - It seems its loading the blocks and i don't see the errors - i will let the store gateway finish loading the blocks and share results here
okay so below are the findings which should make things clear:
1) I could see metrics > 12 hours when a single store gateway was loading the blocks.
2) Unfortunately, due to max cpu utilization, the store gateway restarted . Below are the restart logs (we see not loading tokens from file, tokens file path is empty"
)
level=info ts=2024-02-29T08:34:57.403415617Z caller=module_service.go:64 msg=initialising module=server
level=info ts=2024-02-29T08:34:57.403468083Z caller=module_service.go:64 msg=initialising module=memberlist-kv
level=info ts=2024-02-29T08:34:57.403498101Z caller=module_service.go:64 msg=initialising module=runtime-config
level=info ts=2024-02-29T08:34:57.404603714Z caller=module_service.go:64 msg=initialising module=store-gateway
level=info ts=2024-02-29T08:34:57.404845938Z caller=basic_lifecycler.go:251 msg="instance found in the ring" instance=cortex-store-gateway-0 ring=store-gateway state=ACTIVE tokens=512 registered_at="2024-02-28 11:58:15 +0000 UTC"
level=info ts=2024-02-29T08:34:57.404876496Z caller=basic_lifecycler_delegates.go:63 msg="not loading tokens from file, tokens file path is empty"
level=info ts=2024-02-29T08:34:57.40503789Z caller=gateway.go:230 msg="waiting until store-gateway is JOINING in the ring"
level=info ts=2024-02-29T08:34:58.269324832Z caller=memberlist_client.go:497 msg="joined memberlist cluster" reached_nodes=43
level=info ts=2024-02-29T08:35:12.736728922Z caller=gateway.go:234 msg="store-gateway is JOINING in the ring"
level=info ts=2024-02-29T08:35:12.736775039Z caller=gateway.go:244 msg="waiting until store-gateway ring topology is stable" min_waiting=1m0s max_waiting=5m0s
3) When the store gateway comes alive after the restart, we see inconsistent ring tokens information
in the logs
Actions and suggestions:
1) I have now set the 'store-gateway.sharding-ring.tokens-file-path': '/data/tokens',
2) Will upgrade AKS to configure new nodes with higher spec
3) Does the new cortex version support no resync in store gateway if it restarts according to this fix : https://github.com/cortexproject/cortex/pull/5363
after setting the token path - the store gateway (after restarts due to high cpu) is still showing me the error inconsistent ring tokens information
level=warn ts=2024-02-29T11:14:09.916716249Z caller=sharding_strategy.go:140 msg="failed to check block owner and block has been excluded because was not previously loaded" block=01HJNWQ1EGCHYJ6VKZQWHZ44KT err="inconsistent ring tokens information"
level=info ts=2024-02-29T11:15:20.478908396Z caller=bucket_stores.go:151 msg="successfully synchronized TSDB blocks for all users"
level=info ts=2024-02-29T11:15:20.479090568Z caller=gateway.go:271 msg="waiting until store-gateway is ACTIVE in the ring"
level=info ts=2024-02-29T11:15:20.598497927Z caller=gateway.go:275 msg="store-gateway is ACTIVE in the ring"
level=info ts=2024-02-29T11:15:20.59856337Z caller=cortex.go:436 msg="Cortex started"
btw got it resolved by scheduling the store gateway pods on a higher spec node in AKS and allowed the resync to happen successfully (it restarted few times but continued the resync) for the 1st store gateway pod - gradually scaled up the store gateway thereafter.
Hi Folks,
We are seeing below errors in cortex while querying metrics > 12 hours in grafana.
QUERIER
expanding series: consistency check failed because some blocks were not queried: 01HQQ110QJNE4392CFZWM69K6Z
method=blocksStoreQuerier.selectSorted level=warn msg="unable to get store-gateway clients while retrying to fetch missing blocks" err="no store-gateway instance left after checking exclude for block 01HQQ110QJNE4392CFZWM69K6Z"
STORE GATEWAY
META.JSON for block (seems the source is "compactor") 01HQQ110QJNE4392CFZWM69K6Z.json
CORTEX YAML
CORTEX RESOURCES:
cortex app version:
1.11.0
chart version:1.3.0
Distributor (count):15
Ingester (count):25
Querier (count):2
Query Frontend (count):2
Compactor (count):2
Store Gateway (count):2
and RF:3
We are currently facing this in our production cluster (Day 2) and would appreciate if someone could assist on this?