Closed liangrui1988 closed 1 year ago
https://cortexmetrics.io/docs/blocks-storage/bucket-index/
Please take a look at the doc. Bucket index is created and updated by Compactor component periodically so you need to have it up and running. If you don't have compactor, you can disable bucket index in your read path.
https://cortexmetrics.io/docs/blocks-storage/bucket-index/
Please take a look at the doc. Bucket index is created and updated by Compactor component periodically so you need to have it up and running. If you don't have compactor, you can disable bucket index in your read path.
Shouldn't compactor be default enabled?
I have read the relevant documents, but I can't find the introduction of this section, and how to confirm that compactor is running? How do I configure and verify? Can you help guide me?
my compactor config
compactor:
data_dir: /data/tmp_cortex_data/compactor
sharding_enabled: true
sharding_strategy: shuffle-sharding
sharding_ring:
kvstore:
store: consul
Instance compaction completed successfully log
level=debug ts=2023-09-22T08:21:39.422467759Z caller=ruler.go:485 msg="syncing rules" reason=periodic
level=debug ts=2023-09-22T08:21:40.230728704Z caller=grpc_logging.go:46 method=/grpc.health.v1.Health/Check duration=44.683µs msg="gRPC (success)"
level=debug ts=2023-09-22T08:21:52.917003211Z caller=grpc_logging.go:46 method=/grpc.health.v1.Health/Check duration=41.855µs msg="gRPC (success)"
level=debug ts=2023-09-22T08:21:55.231174555Z caller=grpc_logging.go:46 method=/grpc.health.v1.Health/Check duration=43.984µs msg="gRPC (success)"
level=info ts=2023-09-22T08:22:05.257402463Z caller=shipper.go:334 org_id=hdfs_fsimage msg="upload new block" id=01HAXZMTY275XNBEHSKWQVGKZM
level=debug ts=2023-09-22T08:22:05.258714514Z caller=ingester.go:2279 msg="shipper successfully synchronized TSDB blocks with storage" user=celeborn_metrics uploaded=0
level=debug ts=2023-09-22T08:22:05.258718154Z caller=ingester.go:2279 msg="shipper successfully synchronized TSDB blocks with storage" user=yarn_app_finish uploaded=0
level=debug ts=2023-09-22T08:22:05.25911267Z caller=ingester.go:2279 msg="shipper successfully synchronized TSDB blocks with storage" user=kubesphere_metrics uploaded=0
level=debug ts=2023-09-22T08:22:05.259115908Z caller=ingester.go:2279 msg="shipper successfully synchronized TSDB blocks with storage" user=fake uploaded=0
level=debug ts=2023-09-22T08:22:05.259111442Z caller=ingester.go:2279 msg="shipper successfully synchronized TSDB blocks with storage" user=hdfs_testclsuter uploaded=0
level=debug ts=2023-09-22T08:22:05.259117919Z caller=ingester.go:2279 msg="shipper successfully synchronized TSDB blocks with storage" user=yarn_app_run uploaded=0
level=debug ts=2023-09-22T08:22:05.260376917Z caller=objstore.go:288 org_id=hdfs_fsimage msg="uploaded file" from=/data/tmp_cortex_data/tsdb/hdfs_fsimage/thanos/upload/01HAXZMTY275XNBEHSKWQVGKZM/chunks/000001 dst=01HAXZMTY275XNBEHSKWQVGKZM/chunks/000001 bucket="tracing: fs: /data/cortex/data/tsdb"
level=debug ts=2023-09-22T08:22:05.267450004Z caller=objstore.go:288 org_id=hdfs_fsimage msg="uploaded file" from=/data/tmp_cortex_data/tsdb/hdfs_fsimage/thanos/upload/01HAXZMTY275XNBEHSKWQVGKZM/index dst=01HAXZMTY275XNBEHSKWQVGKZM/index bucket="tracing: fs: /data/cortex/data/tsdb"
level=debug ts=2023-09-22T08:22:05.268059327Z caller=ingester.go:2279 msg="shipper successfully synchronized TSDB blocks with storage" user=hdfs_fsimage uploaded=1
level=debug ts=2023-09-22T08:22:05.602752313Z caller=ingester.go:2368 msg="TSDB blocks compaction completed successfully" user=hdfs_testclsuter compactReason=regular
level=debug ts=2023-09-22T08:22:05.602768281Z caller=ingester.go:2368 msg="TSDB blocks compaction completed successfully" user=fake compactReason=regular
level=debug ts=2023-09-22T08:22:05.602846142Z caller=ingester.go:2368 msg="TSDB blocks compaction completed successfully" user=yarn_app_run compactReason=regular
level=debug ts=2023-09-22T08:22:05.602850518Z caller=ingester.go:2368 msg="TSDB blocks compaction completed successfully" user=celeborn_metrics compactReason=regular
level=debug ts=2023-09-22T08:22:05.603029012Z caller=ingester.go:2368 msg="TSDB blocks compaction completed successfully" user=yarn_app_finish compactReason=regular
level=debug ts=2023-09-22T08:22:05.603215064Z caller=ingester.go:2368 msg="TSDB blocks compaction completed successfully" user=kubesphere_metrics compactReason=regular
level=debug ts=2023-09-22T08:22:07.918550378Z caller=grpc_logging.go:46 method=/grpc.health.v1.Health/Check duration=32.503µs msg="gRPC (success)"
tracing file data Also normal
If you are using Cortex all in one (target=all) then you should have your compactor running in the same process.
If you are using Cortex all in one (target=all) then you should have your compactor running in the same process.
config set target add compactor The bucket index is actually in effect
But why are some node queries always wrong? What do we need to do with this
level=debug ts=2023-09-25T09:18:46.702103245Z caller=grpc_logging.go:46 method=/cortex.Ingester/QueryExemplars duration=22.19µs msg="gRPC (success)"
level=error ts=2023-09-25T09:18:46.702509846Z caller=retry.go:79 org_id=fake msg="error processing request" try=3 err="rpc error: code = Code(500) desc = {\"status\":\"error\",\"errorType\":\"internal\",\"error\":\"expanding series: bucket index is too old and the last time it was updated exceeds the allowed max staleness\"}"
ts=2023-09-25T09:18:46.70258749Z caller=spanlogger.go:87 org_id=fake method=QueryStream level=debug series=4 samples=160
level=debug ts=2023-09-25T09:18:46.702614752Z caller=grpc_logging.go:67 method=/cortex.Ingester/QueryStream duration=143.689µs msg="gRPC (success)"
ts=2023-09-25T09:18:46.702824041Z caller=spanlogger.go:87 org_id=fake method=querier.Select level=debug start="2023-09-25 09:08:45 +0000 UTC" end="2023-09-25 09:18:45 +0000 UTC" step=15000 matchers="unsupported value type"
ts=2023-09-25T09:18:46.703164549Z caller=spanlogger.go:87 org_id=fake method=QueryStream level=debug series=4 samples=160
level=debug ts=2023-09-25T09:18:46.703194485Z caller=grpc_logging.go:67 duration=131.71µs method=/cortex.Ingester/QueryStream msg="gRPC (success)"
ts=2023-09-25T09:18:46.70381203Z caller=spanlogger.go:87 org_id=fake method=QueryStream level=debug series=4 samples=160
level=debug ts=2023-09-25T09:18:46.70384042Z caller=grpc_logging.go:67 method=/cortex.Ingester/QueryStream duration=128.877µs msg="gRPC (success)"
level=error ts=2023-09-25T09:18:46.70383671Z caller=retry.go:79 org_id=fake msg="error processing request" try=4 err="rpc error: code = Code(500) desc = {\"status\":\"error\",\"errorType\":\"internal\",\"error\":\"expanding series: bucket index is too old and the last time it was updated exceeds the allowed max staleness\"}"
level=warn ts=2023-09-25T09:18:46.70389064Z caller=logging.go:86 traceID=7c9a1faf23cffa73 msg="GET /prometheus/api/v1/query_range?query=doris_fe_connection_total&start=1695633225&end=1695633525&step=15 (500) 7.333013ms Response: \"{\\\"status\\\":\\\"error\\\",\\\"errorType\\\":\\\"internal\\\",\\\"error\\\":\\\"expanding series: bucket index is too old and the last time it was updated exceeds the allowed max staleness\\\"}\" ws: false; Accept: application/json, text/plain, */*; Accept-Encoding: gzip, deflate, br; Accept-Language: zh-CN,zh;q=0.9; Sec-Ch-Ua: \".Not/A)Brand\";v=\"99\", \"Google Chrome\";v=\"103\", \"Chromium\";v=\"103\"; Sec-Ch-Ua-Mobile: ?0; Sec-Ch-Ua-Platform: \"Windows\"; Sec-Fetch-Dest: empty; Sec-Fetch-Mode: cors; Sec-Fetch-Site: same-origin; User-Agent: Grafana/7.5.17; X-Dashboard-Id: 2; X-Forwarded-For: 183.6.36.48, 10.12.75.19, 10.12.75.19; X-Grafana-Org-Id: 1; X-Panel-Id: 11; X-Real-Ip: 183.6.36.48; X-Request-Id: 702f120e06a9609fdb354dd5a58b9f6c; X-Scheme: https; X-Scope-Orgid: fake; "
bucket-index.json.gz Only one node is in the update time, and the other nodes are generated at an old time compactor/ring status So why not update the bucket-index.json.gz file?
store: consul update inmemory After the index returned to normal update? compactor use consul need to pay attention to anything? Why is the bucket-index.json.gz file being improperly updated? Do I need to configure an independent directory for the node? The written data is garbled. Is this normal? These configurations are a bad guide to documentation, which often crashes people
compactor config update after
compactor:
data_dir: /data/tmp_cortex_data/compactor
sharding_enabled: true
sharding_strategy: shuffle-sharding
sharding_ring:
kvstore:
#store: consul
#prefix: compactor/
store: inmemory
instance_interface_names: [bond0]
consul kv get compactor/compactor
�f�3
�
#cortex-65-148.hiido.host.int.yy.com�
10.12.65.148:9095�ɨ2������������ŏ����������ׅ�$���&���&���.ʨ�1���4���:�ٲ?˿�B��E���G���Iٗ�M���V���X���X���b���q��y鳑z���z��у�������Đ܈��߈̀��ǎ�����������脃���ԙ��՜�����������������Դ���������ѡ������ƾ����������
......
@liangrui1988
compactor use consul need to pay attention to anything? Why is the bucket-index.json.gz file being improperly updated?
It should be fine. Can you check if you have bucket-index.json.gz
file generated in the bucket? From what I see so far you have the bucket index file generated, it seems it is not updated in time.
Can you check what's the value of compactor.cleanup-interval
flag? If you are using the default value then it is 15 minutes. Ideally bucket index should be updated using this interval. If it is not update to date, then check if your compactor is falling behind or not.
You can use metric cortex_bucket_index_last_successful_update_timestamp_seconds
to check it.
@liangrui1988
compactor use consul need to pay attention to anything? Why is the bucket-index.json.gz file being improperly updated?
It should be fine. Can you check if you have
bucket-index.json.gz
file generated in the bucket? From what I see so far you have the bucket index file generated, it seems it is not updated in time.Can you check what's the value of
compactor.cleanup-interval
flag? If you are using the default value then it is 15 minutes. Ideally bucket index should be updated using this interval. If it is not update to date, then check if your compactor is falling behind or not.You can use metric
cortex_bucket_index_last_successful_update_timestamp_seconds
to check it.
compactor. cleanup-interval use default value 15 minutes
cortex_bucket_index_last_successful_update_timestamp_seconds checked it and found that it did not generate, but occasionally a few updates appeared that did not look very complete update index.
These metrics do not appear at once like a 15-minute update of the configuration
I wonder if the Distributor is not running. I found that my page says Distributor is not running with global limits enabled, how do I enable this
my config
target: compactor,distributor,ingester,purger,querier,query-frontend,ruler,store-gateway
limits:
accept_ha_samples: true
compactor_blocks_retention_period: 2592005s
ingestion_tenant_shard_size: 5
compactor_tenant_shard_size: 5
max_label_names_per_series: 50
max_series_per_metric: 200000
max_global_series_per_metric: 200000
store_gateway_tenant_shard_size: 5
distributor:
shard_by_all_labels: true
pool:
health_check_ingesters: true
ha_tracker:
enable_ha_tracker: true
sharding_strategy: shuffle-sharding
remote_timeout: 5s
ring:
kvstore:
store: consul
prefix: distributor/
.....
When I erased all the data in the fake directory, I found that only one node was generated
rm -rf /data/cortex/tsdb/fake/*
restart cortex
ll /data/cortex/tsdb/fake/
Only one node generates the bucket-index.json.gz file The other 4 nodes didn't generate this file for a long time? What is the reason for this?
It can be confirmed that when I change the configuration as follows, the bucket-index.json.gz file of each node is updated normally. Then I will observe whether the following query result data is normal. Must each node have a separate directory, or should it share a directory? Why is that?
compactor:
data_dir: /data/tmp_cortex_data/compactor
sharding_enabled: true
sharding_strategy: shuffle-sharding
sharding_ring:
kvstore:
store: consul
prefix: compactor/1
....
prefix: compactor/2
prefix: compactor/3
prefix: compactor/4
prefix: compactor/5
.....
Are you using filesystem as bucket? I don't think this is recommended because the bucket won't be shared by multiple instances. Only local instance can access it. I think this might also caused the consul issue you mentioned.
Distributor should be up and running, otherwise ingester cannot write metrics IIUC.
Are you using filesystem as bucket? I don't think this is recommended because the bucket won't be shared by multiple instances. Only local instance can access it. I think this might also caused the consul issue you mentioned.
Yes, I use the file system as my bucket. I thought the cortex could be used as a distributed data store like hdfs. Set an nfs shared directory for verification, but still cannot query data? And why is that?
Here is my configuration
blocks_storage:
tsdb:
dir: /data/tmp_cortex_data/tsdb
retention_period: 13h
bucket_store:
sync_dir: /data/cortex/tsdb-sync
bucket_index:
enabled: true
ignore_blocks_within: 10h
backend: filesystem # s3, gcs, azure or filesystem are valid options
filesystem:
dir: /data/nfs_client/cortex/tsdb
------Also tried this, the same result, the query can not find the data
bucket_index:
enabled: false
sudo mount -t nfs fs-12-65-141.xx.xx.com:/data1/nfs/cortex/ /data/nfs_client/cortex
df -h
fs-12-65-141.xx.xx.com:/data1/nfs/cortex 7.3T 1.1G 7.3T 1% /data/nfs_client/cortex
ll /data/nfs_client/cortex/tsdb/fake/
total 40
drwxr-xr-x 3 root root 4096 Sep 27 15:17 01HBAMT9R4N057AWMF2X69GEHG
drwxr-xr-x 3 root root 4096 Sep 27 14:22 01HBAMTA8TCGM9XV63AMT6W1X0
drwxr-xr-x 3 root root 4096 Sep 27 14:22 01HBAMTSCQ7S59XZ6EJSKVEDM6
drwxr-xr-x 3 root root 4096 Sep 27 15:17 01HBAQ0B4EEZN0RKY2QS40KGX8
drwxr-xr-x 3 root root 4096 Sep 27 15:17 01HBAQ0P26G3F3PMHWCYVRRD75
drwxr-xr-x 3 root root 4096 Sep 27 15:17 01HBAQ0YP42PE6PNPAP9ZVKFDP
drwxr-xr-x 3 root root 4096 Sep 27 15:17 01HBAQ1A4MNS99W6A50BSXEW3D
drwxr-xr-x 3 root root 4096 Sep 27 15:17 01HBAQZEFR8204JQEAXQWV9YXM
-rw-r--r-- 1 root root 423 Sep 27 16:01 bucket-index.json.gz
drwxr-xr-x 2 root root 4096 Sep 27 15:17 markers
Data in filesystem dir still cannot be queried
cortex log
level=debug ts=2023-09-27T07:52:28.878652176Z caller=logging.go:76 traceID=53131909ab0a65c2 msg="POST /prometheus/api/v1/query_exemplars (200) 1.776868ms"
level=debug ts=2023-09-27T07:52:28.89341033Z caller=grpc_logging.go:46 method=/cortex.Ingester/QueryExemplars duration=21.217µs msg="gRPC (success)"
level=debug ts=2023-09-27T07:52:28.894055947Z caller=logging.go:76 traceID=77500f26ef81d6a9 msg="GET /prometheus/api/v1/query_exemplars?query=doris_fe_connection_total&start=1695750720&end=1695801120 (200) 1.340227ms"
ts=2023-09-27T07:52:28.938732098Z caller=spanlogger.go:87 org_id=fake method=querier.Select level=debug start="2023-09-26 17:47:00 +0000 UTC" end="2023-09-27 07:52:00 +0000 UTC" step=30000 matchers="unsupported value type"
ts=2023-09-27T07:52:28.938774809Z caller=spanlogger.go:87 org_id=fake method=blocksStoreQuerier.selectSorted level=debug msg="the max time of the query to blocks storage has been manipulated" original=1695801120000 updated=1695757948938
ts=2023-09-27T07:52:28.938786055Z caller=spanlogger.go:87 org_id=fake method=distributorQuerier.Select level=debug msg="the min time of the query to ingesters has been manipulated" original=1695750420000 updated=1695754348938
ts=2023-09-27T07:52:28.93880729Z caller=spanlogger.go:87 org_id=fake method=blocksStoreQuerier.selectSorted level=debug msg="no blocks found"
level=debug ts=2023-09-27T07:52:28.957023272Z caller=logging.go:76 traceID=1581fc90bc7cc1d3 msg="GET /prometheus/api/v1/query_range?query=doris_fe_connection_total&start=1695750720&end=1695801120&step=30 (200) 18.656604ms"
level=debug ts=2023-09-27T07:52:28.995695776Z caller=grpc_logging.go:46 method=/cortex.Ingester/QueryExemplars duration=32.262µs msg="gRPC (success)"
Distributor should be up and running, otherwise ingester cannot write metrics IIUC. How do we start the Distributor service? I seem to have everything configured on the configuration, but he still hasn't started, why is that? How do I configure him
Is that weird? I restarted disabling compactor After bucket_index is disabled The next day, the data is normal again?
Then I started using compactor again Enable bucket_index All the services are still working. During this period there should be a transition to conflict, to what reason is not currently known. Under follow-up observation.
The current query data is local tsdb (13h) + filesystem tsdb
By the way, because all cortex filesystem dir=/nfs/shared_directory This is a single-node directory that requires Dr. Therefore, data backup needs to be synchronized to ensure data synchronization. The cortex needs to ensure that filesystem dir supports multiple nodes like hdfs
crontab add
rsync -rav --append --delete nfs@fs-12-65-141.xx.xx.com:/data1/nfs/cortex/ /data1/nfs/cortex/
Describe the bug For the Prometheus query cortex, only blocks_storage tsdb dir data can be queried, but bucket_store backend filesystem dir data cannot be queried
To Reproduce
Expected behavior Hopefully the cortex will be able to query the complete data
Environment: my 5 physical machines ubuntu 16.04 deploy cortex 5 Ingesters
One of the consul-config-blocks-local.yaml
supervisor start cortex
Additional Context cortex debug log to warn ->bucket index not found What causes' bucket index not found 'and how do I configure to find bucket index?
grafnan queue cortex Effects are as follows https://github.com/cortexproject/cortex/issues/5529#issuecomment-1707849994