cortexproject / cortex

A horizontally scalable, highly available, multi-tenant, long term Prometheus.
https://cortexmetrics.io/
Apache License 2.0
5.47k stars 797 forks source link

bucket index not found #5574

Closed liangrui1988 closed 1 year ago

liangrui1988 commented 1 year ago

Describe the bug For the Prometheus query cortex, only blocks_storage tsdb dir data can be queried, but bucket_store backend filesystem dir data cannot be queried

To Reproduce

cortex -version
Cortex, version 1.15.3 (branch: release-1.15, revision: 21e836605)
  build user:       
  build date:       
  go version:       go1.19.3
  platform:         linux/amd64
  tags:             netgo
  1. Perform Operations(Read/Write/Others) This is the normal way to remotely put data from Prometheus to cortex storage But the query encountered this problem

Expected behavior Hopefully the cortex will be able to query the complete data

Environment: my 5 physical machines ubuntu 16.04 deploy cortex 5 Ingesters

One of the consul-config-blocks-local.yaml

# Configuration for running Cortex in single-process mode.
# This should not be used in production.  It is only for getting started
# and development.

# Disable the requirement that every request to Cortex has a
# X-Scope-OrgID header. `fake` will be substituted in instead.
auth_enabled: true

server:
  http_listen_port: 9009

  # Configure the server to allow messages up to 100MB.
  grpc_server_max_recv_msg_size: 104857600
  grpc_server_max_send_msg_size: 104857600
  grpc_server_max_concurrent_streams: 1000

limits:
  accept_ha_samples: true
#  compactor_blocks_retention_period: 2592005s
  ingestion_tenant_shard_size: 5
  compactor_tenant_shard_size: 5    
  max_label_names_per_series: 50
  max_series_per_metric: 200000
  max_global_series_per_metric: 200000
  store_gateway_tenant_shard_size: 5     

distributor:
  shard_by_all_labels: true
  pool:
    health_check_ingesters: true
  ha_tracker:
    enable_ha_tracker: true
  sharding_strategy: shuffle-sharding      

ingester_client:
  grpc_client_config:
    # Configure the client to allow messages up to 100MB.
    max_recv_msg_size: 104857600
    max_send_msg_size: 104857600
    grpc_compression: gzip

ingester:
  lifecycler:
    # The address to advertise for this ingester.  Will be autodiscovered by
    # looking up address on eth0 or en0; can be specified if this fails.
    # address: 127.0.0.1

    # We want to start immediately and flush on shutdown.
    min_ready_duration: 0s
    final_sleep: 0s
    num_tokens: 512

    # Use an in memory ring store, so we don't need to launch a Consul.
    ring:
      kvstore:
        store: Consul
      replication_factor: 3
  ignore_series_limit_for_metric_names: limits

blocks_storage:
  tsdb:
    dir: /data/tmp_cortex_data/tsdb
    retention_period: 24h     

  bucket_store:
    sync_dir: /data/tmp_cortex_data/tsdb-sync
    bucket_index:
      enabled: true

  # You can choose between local storage and Amazon S3, Google GCS and Azure storage. Each option requires additional configuration
  # as shown below. All options can be configured via flags as well which might be handy for secret inputs.
  backend: filesystem # s3, gcs, azure or filesystem are valid options
# s3:
#   bucket_name: cortex
#   endpoint: s3.dualstack.us-east-1.amazonaws.com
    # Configure your S3 credentials below.
    # secret_access_key: "TODO"
    # access_key_id:     "TODO"
#  gcs:
#    bucket_name: cortex
#    service_account: # if empty or omitted Cortex will use your default service account as per Google's fallback logic
#  azure:
#    account_name:
#    account_key:
#    container_name:
#    endpoint_suffix:
#    max_retries: # Number of retries for recoverable errors (defaults to 20)
  filesystem:
    dir: /data/cortex/data/tsdb

compactor:
  data_dir: /data/tmp_cortex_data/compactor
  sharding_enabled: true
  sharding_strategy: shuffle-sharding
  sharding_ring:
    kvstore:
      store: Consul

frontend_worker:
  match_max_concurrent: true

ruler:
  enable_api: true

ruler_storage:
  backend: local
  local:
    directory: /data/tmp_cortex_data/rules

store_gateway:
  sharding_enabled: true 

supervisor start cortex

[program:cortex]
command=/data/cortex/cortex -config.file=/data/cortex/conf/consul-config-blocks-local.yaml  -distributor.ring.instance-interface-names=bond0 
    -ingester.lifecycler.interface=bond0 
    -frontend.instance-interface-names=bond0 
    -ruler.ring.instance-interface-names=bond0 
    -alertmanager.sharding-ring.instance-interface-names=bond0 
    -compactor.ring.instance-interface-names=bond0 
    -store-gateway.sharding-ring.instance-interface-names=bond0
        -distributor.shard-by-all-labels=true
    -ring.store=consul 
    -consul.hostname=10.12.29.3:8500 
    -distributor.replication-factor=3 
    #-runtime-config.file=/data/cortex/conf/runtime-config.yaml
autostart=true
autorestart=true
startretries=5
stderr_logfile=/data/logs/cortex/stderr.log
stderr_logfile_maxbytes=10MB
stdout_logfile=/data/logs/cortex/stdout.log
stdout_logfile_maxbytes=10MB

Additional Context cortex debug log to warn ->bucket index not found What causes' bucket index not found 'and how do I configure to find bucket index?

level=warn ts=2023-09-20T13:55:30.300188789Z caller=grpc_logging.go:43 method=/cortex.Ingester/QueryExemplars duration=37.9µs err="rpc error: code = Unavailable desc = Starting" msg=gRPC
level=debug ts=2023-09-20T13:55:30.302172524Z caller=logging.go:76 traceID=5a8cfd3c622c52cb msg="POST /prometheus/api/v1/query_exemplars (200) 3.54803ms"
ts=2023-09-20T13:55:30.330378613Z caller=spanlogger.go:87 org_id=fake method=querier.Select level=debug start="2023-09-19 07:50:00 +0000 UTC" end="2023-09-20 13:55:00 +0000 UTC" step=60000 matchers="unsupported value type"
level=warn ts=2023-09-20T13:55:30.330495092Z caller=loader.go:117 msg="bucket index not found" user=fake
ts=2023-09-20T13:55:30.33051205Z caller=spanlogger.go:87 org_id=fake method=blocksStoreQuerier.selectSorted level=debug msg="no blocks found"
level=warn ts=2023-09-20T13:55:30.330750852Z caller=grpc_logging.go:64 duration=58.006µs method=/cortex.Ingester/QueryStream err="rpc error: code = Unavailable desc = Starting" msg=gRPC
level=debug ts=2023-09-20T13:55:30.358337311Z caller=logging.go:76 traceID=28cf5ca693933eb0 msg="POST /prometheus/api/v1/query_range (200) 28.531762ms"
level=debug ts=2023-09-20T13:55:32.73972313Z caller=grpc_logging.go:46 method=/grpc.health.v1.Health/Check duration=80.454µs msg="gRPC (success)"

grafnan queue cortex Effects are as follows https://github.com/cortexproject/cortex/issues/5529#issuecomment-1707849994

yeya24 commented 1 year ago

https://cortexmetrics.io/docs/blocks-storage/bucket-index/

Please take a look at the doc. Bucket index is created and updated by Compactor component periodically so you need to have it up and running. If you don't have compactor, you can disable bucket index in your read path.

liangrui1988 commented 1 year ago

https://cortexmetrics.io/docs/blocks-storage/bucket-index/

Please take a look at the doc. Bucket index is created and updated by Compactor component periodically so you need to have it up and running. If you don't have compactor, you can disable bucket index in your read path.

Shouldn't compactor be default enabled?

I have read the relevant documents, but I can't find the introduction of this section, and how to confirm that compactor is running? How do I configure and verify? Can you help guide me?

my compactor config

compactor:
  data_dir: /data/tmp_cortex_data/compactor
  sharding_enabled: true
  sharding_strategy: shuffle-sharding
  sharding_ring:
    kvstore:
      store: consul

Instance compaction completed successfully log

level=debug ts=2023-09-22T08:21:39.422467759Z caller=ruler.go:485 msg="syncing rules" reason=periodic
level=debug ts=2023-09-22T08:21:40.230728704Z caller=grpc_logging.go:46 method=/grpc.health.v1.Health/Check duration=44.683µs msg="gRPC (success)"
level=debug ts=2023-09-22T08:21:52.917003211Z caller=grpc_logging.go:46 method=/grpc.health.v1.Health/Check duration=41.855µs msg="gRPC (success)"
level=debug ts=2023-09-22T08:21:55.231174555Z caller=grpc_logging.go:46 method=/grpc.health.v1.Health/Check duration=43.984µs msg="gRPC (success)"
level=info ts=2023-09-22T08:22:05.257402463Z caller=shipper.go:334 org_id=hdfs_fsimage msg="upload new block" id=01HAXZMTY275XNBEHSKWQVGKZM
level=debug ts=2023-09-22T08:22:05.258714514Z caller=ingester.go:2279 msg="shipper successfully synchronized TSDB blocks with storage" user=celeborn_metrics uploaded=0
level=debug ts=2023-09-22T08:22:05.258718154Z caller=ingester.go:2279 msg="shipper successfully synchronized TSDB blocks with storage" user=yarn_app_finish uploaded=0
level=debug ts=2023-09-22T08:22:05.25911267Z caller=ingester.go:2279 msg="shipper successfully synchronized TSDB blocks with storage" user=kubesphere_metrics uploaded=0
level=debug ts=2023-09-22T08:22:05.259115908Z caller=ingester.go:2279 msg="shipper successfully synchronized TSDB blocks with storage" user=fake uploaded=0
level=debug ts=2023-09-22T08:22:05.259111442Z caller=ingester.go:2279 msg="shipper successfully synchronized TSDB blocks with storage" user=hdfs_testclsuter uploaded=0
level=debug ts=2023-09-22T08:22:05.259117919Z caller=ingester.go:2279 msg="shipper successfully synchronized TSDB blocks with storage" user=yarn_app_run uploaded=0
level=debug ts=2023-09-22T08:22:05.260376917Z caller=objstore.go:288 org_id=hdfs_fsimage msg="uploaded file" from=/data/tmp_cortex_data/tsdb/hdfs_fsimage/thanos/upload/01HAXZMTY275XNBEHSKWQVGKZM/chunks/000001 dst=01HAXZMTY275XNBEHSKWQVGKZM/chunks/000001 bucket="tracing: fs: /data/cortex/data/tsdb"
level=debug ts=2023-09-22T08:22:05.267450004Z caller=objstore.go:288 org_id=hdfs_fsimage msg="uploaded file" from=/data/tmp_cortex_data/tsdb/hdfs_fsimage/thanos/upload/01HAXZMTY275XNBEHSKWQVGKZM/index dst=01HAXZMTY275XNBEHSKWQVGKZM/index bucket="tracing: fs: /data/cortex/data/tsdb"
level=debug ts=2023-09-22T08:22:05.268059327Z caller=ingester.go:2279 msg="shipper successfully synchronized TSDB blocks with storage" user=hdfs_fsimage uploaded=1
level=debug ts=2023-09-22T08:22:05.602752313Z caller=ingester.go:2368 msg="TSDB blocks compaction completed successfully" user=hdfs_testclsuter compactReason=regular
level=debug ts=2023-09-22T08:22:05.602768281Z caller=ingester.go:2368 msg="TSDB blocks compaction completed successfully" user=fake compactReason=regular
level=debug ts=2023-09-22T08:22:05.602846142Z caller=ingester.go:2368 msg="TSDB blocks compaction completed successfully" user=yarn_app_run compactReason=regular
level=debug ts=2023-09-22T08:22:05.602850518Z caller=ingester.go:2368 msg="TSDB blocks compaction completed successfully" user=celeborn_metrics compactReason=regular
level=debug ts=2023-09-22T08:22:05.603029012Z caller=ingester.go:2368 msg="TSDB blocks compaction completed successfully" user=yarn_app_finish compactReason=regular
level=debug ts=2023-09-22T08:22:05.603215064Z caller=ingester.go:2368 msg="TSDB blocks compaction completed successfully" user=kubesphere_metrics compactReason=regular
level=debug ts=2023-09-22T08:22:07.918550378Z caller=grpc_logging.go:46 method=/grpc.health.v1.Health/Check duration=32.503µs msg="gRPC (success)"

tracing file data Also normal image

yeya24 commented 1 year ago

If you are using Cortex all in one (target=all) then you should have your compactor running in the same process.

liangrui1988 commented 1 year ago

If you are using Cortex all in one (target=all) then you should have your compactor running in the same process.

config set target add compactor The bucket index is actually in effect

But why are some node queries always wrong? What do we need to do with this

level=debug ts=2023-09-25T09:18:46.702103245Z caller=grpc_logging.go:46 method=/cortex.Ingester/QueryExemplars duration=22.19µs msg="gRPC (success)"
level=error ts=2023-09-25T09:18:46.702509846Z caller=retry.go:79 org_id=fake msg="error processing request" try=3 err="rpc error: code = Code(500) desc = {\"status\":\"error\",\"errorType\":\"internal\",\"error\":\"expanding series: bucket index is too old and the last time it was updated exceeds the allowed max staleness\"}"
ts=2023-09-25T09:18:46.70258749Z caller=spanlogger.go:87 org_id=fake method=QueryStream level=debug series=4 samples=160
level=debug ts=2023-09-25T09:18:46.702614752Z caller=grpc_logging.go:67 method=/cortex.Ingester/QueryStream duration=143.689µs msg="gRPC (success)"
ts=2023-09-25T09:18:46.702824041Z caller=spanlogger.go:87 org_id=fake method=querier.Select level=debug start="2023-09-25 09:08:45 +0000 UTC" end="2023-09-25 09:18:45 +0000 UTC" step=15000 matchers="unsupported value type"
ts=2023-09-25T09:18:46.703164549Z caller=spanlogger.go:87 org_id=fake method=QueryStream level=debug series=4 samples=160
level=debug ts=2023-09-25T09:18:46.703194485Z caller=grpc_logging.go:67 duration=131.71µs method=/cortex.Ingester/QueryStream msg="gRPC (success)"
ts=2023-09-25T09:18:46.70381203Z caller=spanlogger.go:87 org_id=fake method=QueryStream level=debug series=4 samples=160
level=debug ts=2023-09-25T09:18:46.70384042Z caller=grpc_logging.go:67 method=/cortex.Ingester/QueryStream duration=128.877µs msg="gRPC (success)"
level=error ts=2023-09-25T09:18:46.70383671Z caller=retry.go:79 org_id=fake msg="error processing request" try=4 err="rpc error: code = Code(500) desc = {\"status\":\"error\",\"errorType\":\"internal\",\"error\":\"expanding series: bucket index is too old and the last time it was updated exceeds the allowed max staleness\"}"
level=warn ts=2023-09-25T09:18:46.70389064Z caller=logging.go:86 traceID=7c9a1faf23cffa73 msg="GET /prometheus/api/v1/query_range?query=doris_fe_connection_total&start=1695633225&end=1695633525&step=15 (500) 7.333013ms Response: \"{\\\"status\\\":\\\"error\\\",\\\"errorType\\\":\\\"internal\\\",\\\"error\\\":\\\"expanding series: bucket index is too old and the last time it was updated exceeds the allowed max staleness\\\"}\" ws: false; Accept: application/json, text/plain, */*; Accept-Encoding: gzip, deflate, br; Accept-Language: zh-CN,zh;q=0.9; Sec-Ch-Ua: \".Not/A)Brand\";v=\"99\", \"Google Chrome\";v=\"103\", \"Chromium\";v=\"103\"; Sec-Ch-Ua-Mobile: ?0; Sec-Ch-Ua-Platform: \"Windows\"; Sec-Fetch-Dest: empty; Sec-Fetch-Mode: cors; Sec-Fetch-Site: same-origin; User-Agent: Grafana/7.5.17; X-Dashboard-Id: 2; X-Forwarded-For: 183.6.36.48, 10.12.75.19, 10.12.75.19; X-Grafana-Org-Id: 1; X-Panel-Id: 11; X-Real-Ip: 183.6.36.48; X-Request-Id: 702f120e06a9609fdb354dd5a58b9f6c; X-Scheme: https; X-Scope-Orgid: fake; "

bucket-index.json.gz Only one node is in the update time, and the other nodes are generated at an old time compactor/ring status image So why not update the bucket-index.json.gz file?

liangrui1988 commented 1 year ago

store: consul update inmemory After the index returned to normal update? compactor use consul need to pay attention to anything? Why is the bucket-index.json.gz file being improperly updated? Do I need to configure an independent directory for the node? The written data is garbled. Is this normal? These configurations are a bad guide to documentation, which often crashes people

compactor config update after

compactor:
  data_dir: /data/tmp_cortex_data/compactor
  sharding_enabled: true
  sharding_strategy: shuffle-sharding
  sharding_ring:
    kvstore:      
      #store: consul
      #prefix: compactor/
    store: inmemory
    instance_interface_names: [bond0]

consul kv get compactor/compactor

�f�3
�
#cortex-65-148.hiido.host.int.yy.com�
10.12.65.148:9095�ɨ2������������ŏ����������ׅ�$���&���&���.ʨ�1���4���:�ٲ?˿�B��E���G���Iٗ�M���V���X���X���b���q��y鳑z���z��у�������Đ܈��߈̀��ǎ�����������脃���ԙ��՜�����������������Դ���������ѡ������ƾ����������
......
yeya24 commented 1 year ago

@liangrui1988

compactor use consul need to pay attention to anything? Why is the bucket-index.json.gz file being improperly updated?

It should be fine. Can you check if you have bucket-index.json.gz file generated in the bucket? From what I see so far you have the bucket index file generated, it seems it is not updated in time.

Can you check what's the value of compactor.cleanup-interval flag? If you are using the default value then it is 15 minutes. Ideally bucket index should be updated using this interval. If it is not update to date, then check if your compactor is falling behind or not.

You can use metric cortex_bucket_index_last_successful_update_timestamp_seconds to check it.

liangrui1988 commented 1 year ago

@liangrui1988

compactor use consul need to pay attention to anything? Why is the bucket-index.json.gz file being improperly updated?

It should be fine. Can you check if you have bucket-index.json.gz file generated in the bucket? From what I see so far you have the bucket index file generated, it seems it is not updated in time.

Can you check what's the value of compactor.cleanup-interval flag? If you are using the default value then it is 15 minutes. Ideally bucket index should be updated using this interval. If it is not update to date, then check if your compactor is falling behind or not.

You can use metric cortex_bucket_index_last_successful_update_timestamp_seconds to check it.

compactor. cleanup-interval use default value 15 minutes

cortex_bucket_index_last_successful_update_timestamp_seconds checked it and found that it did not generate, but occasionally a few updates appeared that did not look very complete update index.

image image

These metrics do not appear at once like a 15-minute update of the configuration image

I wonder if the Distributor is not running. I found that my page says Distributor is not running with global limits enabled, how do I enable this image image

my config

target: compactor,distributor,ingester,purger,querier,query-frontend,ruler,store-gateway

limits:
  accept_ha_samples: true
  compactor_blocks_retention_period: 2592005s
  ingestion_tenant_shard_size: 5
  compactor_tenant_shard_size: 5
  max_label_names_per_series: 50
  max_series_per_metric: 200000
  max_global_series_per_metric: 200000    
  store_gateway_tenant_shard_size: 5  

distributor:
  shard_by_all_labels: true
  pool:
    health_check_ingesters: true
  ha_tracker:
    enable_ha_tracker: true
  sharding_strategy: shuffle-sharding 
  remote_timeout: 5s
  ring:
    kvstore:
      store: consul
      prefix: distributor/

.....

When I erased all the data in the fake directory, I found that only one node was generated

rm -rf /data/cortex/tsdb/fake/*
restart cortex
ll /data/cortex/tsdb/fake/

Only one node generates the bucket-index.json.gz file The other 4 nodes didn't generate this file for a long time? What is the reason for this?

liangrui1988 commented 1 year ago

It can be confirmed that when I change the configuration as follows, the bucket-index.json.gz file of each node is updated normally. Then I will observe whether the following query result data is normal. Must each node have a separate directory, or should it share a directory? Why is that?

compactor:
  data_dir: /data/tmp_cortex_data/compactor
  sharding_enabled: true
  sharding_strategy: shuffle-sharding
  sharding_ring:
    kvstore:      
      store: consul
      prefix: compactor/1

....
      prefix: compactor/2
      prefix: compactor/3
      prefix: compactor/4
      prefix: compactor/5     
.....
yeya24 commented 1 year ago

Are you using filesystem as bucket? I don't think this is recommended because the bucket won't be shared by multiple instances. Only local instance can access it. I think this might also caused the consul issue you mentioned.

Distributor should be up and running, otherwise ingester cannot write metrics IIUC.

liangrui1988 commented 1 year ago

Are you using filesystem as bucket? I don't think this is recommended because the bucket won't be shared by multiple instances. Only local instance can access it. I think this might also caused the consul issue you mentioned.

Yes, I use the file system as my bucket. I thought the cortex could be used as a distributed data store like hdfs. Set an nfs shared directory for verification, but still cannot query data? And why is that?

Here is my configuration

blocks_storage:
  tsdb:
    dir: /data/tmp_cortex_data/tsdb
    retention_period: 13h
  bucket_store:
    sync_dir: /data/cortex/tsdb-sync
    bucket_index:
      enabled: true
    ignore_blocks_within: 10h
  backend: filesystem # s3, gcs, azure or filesystem are valid options
  filesystem:
    dir: /data/nfs_client/cortex/tsdb

------Also tried this, the same result, the query can not find the data
    bucket_index:
      enabled: false
sudo mount -t nfs fs-12-65-141.xx.xx.com:/data1/nfs/cortex/ /data/nfs_client/cortex
df -h
fs-12-65-141.xx.xx.com:/data1/nfs/cortex  7.3T  1.1G  7.3T   1% /data/nfs_client/cortex

ll /data/nfs_client/cortex/tsdb/fake/
total 40
drwxr-xr-x 3 root root 4096 Sep 27 15:17 01HBAMT9R4N057AWMF2X69GEHG
drwxr-xr-x 3 root root 4096 Sep 27 14:22 01HBAMTA8TCGM9XV63AMT6W1X0
drwxr-xr-x 3 root root 4096 Sep 27 14:22 01HBAMTSCQ7S59XZ6EJSKVEDM6
drwxr-xr-x 3 root root 4096 Sep 27 15:17 01HBAQ0B4EEZN0RKY2QS40KGX8
drwxr-xr-x 3 root root 4096 Sep 27 15:17 01HBAQ0P26G3F3PMHWCYVRRD75
drwxr-xr-x 3 root root 4096 Sep 27 15:17 01HBAQ0YP42PE6PNPAP9ZVKFDP
drwxr-xr-x 3 root root 4096 Sep 27 15:17 01HBAQ1A4MNS99W6A50BSXEW3D
drwxr-xr-x 3 root root 4096 Sep 27 15:17 01HBAQZEFR8204JQEAXQWV9YXM
-rw-r--r-- 1 root root  423 Sep 27 16:01 bucket-index.json.gz
drwxr-xr-x 2 root root 4096 Sep 27 15:17 markers

Data in filesystem dir still cannot be queried image

cortex log

level=debug ts=2023-09-27T07:52:28.878652176Z caller=logging.go:76 traceID=53131909ab0a65c2 msg="POST /prometheus/api/v1/query_exemplars (200) 1.776868ms"
level=debug ts=2023-09-27T07:52:28.89341033Z caller=grpc_logging.go:46 method=/cortex.Ingester/QueryExemplars duration=21.217µs msg="gRPC (success)"
level=debug ts=2023-09-27T07:52:28.894055947Z caller=logging.go:76 traceID=77500f26ef81d6a9 msg="GET /prometheus/api/v1/query_exemplars?query=doris_fe_connection_total&start=1695750720&end=1695801120 (200) 1.340227ms"
ts=2023-09-27T07:52:28.938732098Z caller=spanlogger.go:87 org_id=fake method=querier.Select level=debug start="2023-09-26 17:47:00 +0000 UTC" end="2023-09-27 07:52:00 +0000 UTC" step=30000 matchers="unsupported value type"
ts=2023-09-27T07:52:28.938774809Z caller=spanlogger.go:87 org_id=fake method=blocksStoreQuerier.selectSorted level=debug msg="the max time of the query to blocks storage has been manipulated" original=1695801120000 updated=1695757948938
ts=2023-09-27T07:52:28.938786055Z caller=spanlogger.go:87 org_id=fake method=distributorQuerier.Select level=debug msg="the min time of the query to ingesters has been manipulated" original=1695750420000 updated=1695754348938
ts=2023-09-27T07:52:28.93880729Z caller=spanlogger.go:87 org_id=fake method=blocksStoreQuerier.selectSorted level=debug msg="no blocks found"
level=debug ts=2023-09-27T07:52:28.957023272Z caller=logging.go:76 traceID=1581fc90bc7cc1d3 msg="GET /prometheus/api/v1/query_range?query=doris_fe_connection_total&start=1695750720&end=1695801120&step=30 (200) 18.656604ms"
level=debug ts=2023-09-27T07:52:28.995695776Z caller=grpc_logging.go:46 method=/cortex.Ingester/QueryExemplars duration=32.262µs msg="gRPC (success)"

Distributor should be up and running, otherwise ingester cannot write metrics IIUC. How do we start the Distributor service? I seem to have everything configured on the configuration, but he still hasn't started, why is that? How do I configure him

liangrui1988 commented 1 year ago

Is that weird? I restarted disabling compactor After bucket_index is disabled The next day, the data is normal again?

Then I started using compactor again Enable bucket_index All the services are still working. During this period there should be a transition to conflict, to what reason is not currently known. Under follow-up observation.

The current query data is local tsdb (13h) + filesystem tsdb image

By the way, because all cortex filesystem dir=/nfs/shared_directory This is a single-node directory that requires Dr. Therefore, data backup needs to be synchronized to ensure data synchronization. The cortex needs to ensure that filesystem dir supports multiple nodes like hdfs

crontab add rsync -rav --append --delete nfs@fs-12-65-141.xx.xx.com:/data1/nfs/cortex/ /data1/nfs/cortex/