cortexproject / cortex

A horizontally scalable, highly available, multi-tenant, long term Prometheus.
https://cortexmetrics.io/
Apache License 2.0
5.42k stars 790 forks source link

Compaction is Not Running Properly Because of Out-of-order Chunk #5584

Open LERUfic opened 10 months ago

LERUfic commented 10 months ago

Describe the bug Due to out-of-order chunk, the compactor is not performing compaction as expected. Although I have added the skip_blocks_with_out_of_order_chunks_enabled: true configuration, the block is not being marked as non-compact.

To Reproduce Steps to reproduce the behavior:

  1. Start Cortex 1.14.1
    / # cortex --version
    Cortex, version 1.14.1 (branch: HEAD, revision: 984ac41)
    build user:
    build date:
    go version:       go1.19
    platform:         linux/amd64
  2. Run the compaction process

Expected behavior I expect the compaction process is running smoothly and even the out-of-order happening it's should be skipped because of the skip_blocks_with_out_of_order_chunks_enabled: true config.

Environment:

Additional Context My compactor config

auth_enabled: true
tenant_federation:
  enabled: true
limits:
  enforce_metric_name: true
  reject_old_samples: true
  reject_old_samples_max_age: 365d
  max_label_name_length: 2048
  max_label_value_length: 4096
  max_label_names_per_series: 1024
  max_metadata_length: 2048
  max_query_lookback: 0
  compactor_blocks_retention_period: 365d
  max_series_per_user: 0
  max_series_per_metric: 0
  max_fetched_chunks_per_query: 0
  max_series_per_query: 10000000
  max_metadata_per_user: 0
  max_metadata_per_metric: 0
server:
  http_listen_port: 8080
  grpc_listen_port: 9095
  grpc_server_max_recv_msg_size: 100000000
  grpc_server_max_send_msg_size: 100000000
  grpc_server_max_concurrent_streams: 10000
  log_level: info
ingester_client:
  grpc_client_config:
    max_recv_msg_size: 100000000
    max_send_msg_size: 100000000
storage:
  engine: blocks
blocks_storage:
  backend: gcs
  gcs:
    bucket_name: <redacted>
  tsdb:
    dir: /data/tsdb
    block_ranges_period:
      - 1h0m0s
    retention_period: 10h
  bucket_store:
    ignore_deletion_mark_delay: 1h
    sync_dir: /data/tsdb-sync
    max_concurrent: 1000
    bucket_index:
      enabled: true
      max_stale_period: 24h
store_gateway:
  sharding_enabled: true
  sharding_ring:
    kvstore:
      store: "memberlist"
    replication_factor: 2
memberlist:
  bind_port: 7946
  join_members:
    - '{{ include "cortex.fullname" $ }}-memberlist'
compactor:
  sharding_enabled: true
  sharding_ring:
    kvstore:
      store: "memberlist"
  skip_blocks_with_out_of_order_chunks_enabled: true
...

Runtime Config

runtime_config:
  overrides:
    prometheus-data-prd:
      max_query_lookback: 0
      compactor_blocks_retention_period: 365d
      max_series_per_metric: 0
      max_series_per_query: 20000000
      ingestion_rate: 20000000

Metrics

cortex_compactor_runs_failed_total 1
cortex_compactor_runs_started_total 1
cortex_compactor_runs_completed_total 0
cortex_compactor_runs_interrupted_total 0
cortex_bucket_blocks_marked_for_no_compaction_count{user="prometheus-data-prd"} 0

Error logs

{"blocks":"[data/compact/0@8032743924406704676/01GVSPKX0477E5R42A89AJ4C2K data/compact/0@8032743924406704676/01GVSPKY9YGRT3NC0GT0E6ANTZ data/compact/0@8032743924406704676/01GVSPJTT7AFDGS33Z4BPANNXA data/compact/0@8032743924406704676/01GVSPK01NDHJWSX9ZD46R24PK data/compact/0@8032743924406704676/01GVSPKPN74CV8RWQ6RNNHDHBJ data/compact/0@8032743924406704676/01GVSPJWFZ3SCRG2R0ZNG8EJSQ data/compact/0@8032743924406704676/01GVSPJH83QGFKT0ZVW23KBCM1 data/compact/0@8032743924406704676/01GVSPK8JGH1XYXV0SM07C7QKZ data/compact/0@8032743924406704676/01GVSPJWYJKZWNS6H2NB06MSG8 data/compact/0@8032743924406704676/01GVSPKJ3K3BYJCTWTX61NFV6X]","caller":"compact.go:1097","component":"compactor","duration":"31.589419502s","duration_ms":31589,"group":"0@{__org_id__=\"prometheus-data-prd\"}","groupKey":"0@8032743924406704676","level":"info","msg":"compacted blocks","new":"01HBSWWRTXT3BC29KZERZJ0ZYZ","org_id":"prometheus-data-prd","overlapping_blocks":true,"ts":"2023-10-03T04:32:15.520558663Z"}
{"caller":"index.go:331","component":"compactor","group":"0@{__org_id__=\"prometheus-data-prd\"}","groupKey":"0@8032743924406704676","labels":{"<redacted>"},"level":"debug","msg":"found out of order series","org_id":"prometheus-data-prd","ts":"2023-10-03T04:32:17.77270071Z"}
{"caller":"index.go:331","component":"compactor","group":"0@{__org_id__=\"prometheus-data-prd\"}","groupKey":"0@8032743924406704676","labels":{"<redacted>"},"level":"debug","msg":"found out of order series","org_id":"prometheus-data-prd","ts":"2023-10-03T04:32:17.83720268Z"}
{"caller":"index.go:331","component":"compactor","group":"0@{__org_id__=\"prometheus-data-prd\"}","groupKey":"0@8032743924406704676","labels":{"<redacted>"},"level":"debug","msg":"found out of order series","org_id":"prometheus-data-prd","ts":"2023-10-03T04:32:17.837553736Z"}
{"caller":"index.go:331","component":"compactor","group":"0@{__org_id__=\"prometheus-data-prd\"}","groupKey":"0@8032743924406704676","labels":{"<redacted>"},"level":"debug","msg":"found out of order series","org_id":"prometheus-data-prd","ts":"2023-10-03T04:32:17.837782653Z"}
{"caller":"index.go:331","component":"compactor","group":"0@{__org_id__=\"prometheus-data-prd\"}","groupKey":"0@8032743924406704676","labels":{"<redacted>"},"level":"debug","msg":"found out of order series","org_id":"prometheus-data-prd","ts":"2023-10-03T04:32:17.838058837Z"}
{"caller":"index.go:331","component":"compactor","group":"0@{__org_id__=\"prometheus-data-prd\"}","groupKey":"0@8032743924406704676","labels":{"<redacted>"},"level":"debug","msg":"found out of order series","org_id":"prometheus-data-prd","ts":"2023-10-03T04:32:17.8382072Z"}
{"caller":"index.go:331","component":"compactor","group":"0@{__org_id__=\"prometheus-data-prd\"}","groupKey":"0@8032743924406704676","labels":{"<redacted>"},"level":"debug","msg":"found out of order series","org_id":"prometheus-data-prd","ts":"2023-10-03T04:32:17.838398942Z"}
{"caller":"index.go:331","component":"compactor","group":"0@{__org_id__=\"prometheus-data-prd\"}","groupKey":"0@8032743924406704676","labels":{"<redacted>"},"level":"debug","msg":"found out of order series","org_id":"prometheus-data-prd","ts":"2023-10-03T04:32:17.83861866Z"}
{"caller":"index.go:331","component":"compactor","group":"0@{__org_id__=\"prometheus-data-prd\"}","groupKey":"0@8032743924406704676","labels":{"<redacted>"},"level":"debug","msg":"found out of order series","org_id":"prometheus-data-prd","ts":"2023-10-03T04:32:17.838865968Z"}
{"caller":"index.go:331","component":"compactor","group":"0@{__org_id__=\"prometheus-data-prd\"}","groupKey":"0@8032743924406704676","labels":{"<redacted>"},"level":"debug","msg":"found out of order series","org_id":"prometheus-data-prd","ts":"2023-10-03T04:32:17.83909492Z"}
{"caller":"index.go:331","component":"compactor","group":"0@{__org_id__=\"prometheus-data-prd\"}","groupKey":"0@8032743924406704676","labels":{"<redacted>"},"level":"debug","msg":"found out of order series","org_id":"prometheus-data-prd","ts":"2023-10-03T04:32:17.839210632Z"}
{"caller":"index.go:331","component":"compactor","group":"0@{__org_id__=\"prometheus-data-prd\"}","groupKey":"0@8032743924406704676","labels":{"<redacted>"},"level":"debug","msg":"found out of order series","org_id":"prometheus-data-prd","ts":"2023-10-03T04:32:17.839330004Z"}
{"caller":"index.go:331","component":"compactor","group":"0@{__org_id__=\"prometheus-data-prd\"}","groupKey":"0@8032743924406704676","labels":{"<redacted>"},"level":"debug","msg":"found out of order series","org_id":"prometheus-data-prd","ts":"2023-10-03T04:32:17.839442933Z"}
{"caller":"index.go:331","component":"compactor","group":"0@{__org_id__=\"prometheus-data-prd\"}","groupKey":"0@8032743924406704676","labels":{"<redacted>"},"level":"debug","msg":"found out of order series","org_id":"prometheus-data-prd","ts":"2023-10-03T04:32:17.839611222Z"}
{"caller":"index.go:331","component":"compactor","group":"0@{__org_id__=\"prometheus-data-prd\"}","groupKey":"0@8032743924406704676","labels":{"<redacted>"},"level":"debug","msg":"found out of order series","org_id":"prometheus-data-prd","ts":"2023-10-03T04:32:17.839776342Z"}
{"caller":"index.go:331","component":"compactor","group":"0@{__org_id__=\"prometheus-data-prd\"}","groupKey":"0@8032743924406704676","labels":{"<redacted>"},"level":"debug","msg":"found out of order series","org_id":"prometheus-data-prd","ts":"2023-10-03T04:32:17.862857914Z"}
{"caller":"index.go:331","component":"compactor","group":"0@{__org_id__=\"prometheus-data-prd\"}","groupKey":"0@8032743924406704676","labels":{"<redacted>"},"level":"debug","msg":"found out of order series","org_id":"prometheus-data-prd","ts":"2023-10-03T04:32:17.863121857Z"}
{"caller":"index.go:331","component":"compactor","group":"0@{__org_id__=\"prometheus-data-prd\"}","groupKey":"0@8032743924406704676","labels":{"<redacted>"},"level":"debug","msg":"found out of order series","org_id":"prometheus-data-prd","ts":"2023-10-03T04:32:17.863390645Z"}
{"caller":"index.go:331","component":"compactor","group":"0@{__org_id__=\"prometheus-data-prd\"}","groupKey":"0@8032743924406704676","labels":{"<redacted>"},"level":"debug","msg":"found out of order series","org_id":"prometheus-data-prd","ts":"2023-10-03T04:32:17.863593061Z"}
{"caller":"index.go:331","component":"compactor","group":"0@{__org_id__=\"prometheus-data-prd\"}","groupKey":"0@8032743924406704676","labels":{"<redacted>"},"level":"debug","msg":"found out of order series","org_id":"prometheus-data-prd","ts":"2023-10-03T04:32:17.863937026Z"}
{"caller":"index.go:331","component":"compactor","group":"0@{__org_id__=\"prometheus-data-prd\"}","groupKey":"0@8032743924406704676","labels":{"<redacted>"},"level":"debug","msg":"found out of order series","org_id":"prometheus-data-prd","ts":"2023-10-03T04:32:17.864282218Z"}
{"caller":"index.go:331","component":"compactor","group":"0@{__org_id__=\"prometheus-data-prd\"}","groupKey":"0@8032743924406704676","labels":{"<redacted>"},"level":"debug","msg":"found out of order series","org_id":"prometheus-data-prd","ts":"2023-10-03T04:32:17.864422168Z"}
{"caller":"index.go:331","component":"compactor","group":"0@{__org_id__=\"prometheus-data-prd\"}","groupKey":"0@8032743924406704676","labels":{"<redacted>"},"level":"debug","msg":"found out of order series","org_id":"prometheus-data-prd","ts":"2023-10-03T04:32:17.864677397Z"}
{"caller":"index.go:331","component":"compactor","group":"0@{__org_id__=\"prometheus-data-prd\"}","groupKey":"0@8032743924406704676","labels":{"<redacted>"},"level":"debug","msg":"found out of order series","org_id":"prometheus-data-prd","ts":"2023-10-03T04:32:17.864931805Z"}
{"caller":"index.go:331","component":"compactor","group":"0@{__org_id__=\"prometheus-data-prd\"}","groupKey":"0@8032743924406704676","labels":{"<redacted>"},"level":"debug","msg":"found out of order series","org_id":"prometheus-data-prd","ts":"2023-10-03T04:32:17.865108675Z"}
{"caller":"index.go:331","component":"compactor","group":"0@{__org_id__=\"prometheus-data-prd\"}","groupKey":"0@8032743924406704676","labels":{"<redacted>"},"level":"debug","msg":"found out of order series","org_id":"prometheus-data-prd","ts":"2023-10-03T04:32:17.865267327Z"}
{"caller":"index.go:331","component":"compactor","group":"0@{__org_id__=\"prometheus-data-prd\"}","groupKey":"0@8032743924406704676","labels":{"<redacted>"},"level":"debug","msg":"found out of order series","org_id":"prometheus-data-prd","ts":"2023-10-03T04:32:17.865520627Z"}
{"caller":"index.go:331","component":"compactor","group":"0@{__org_id__=\"prometheus-data-prd\"}","groupKey":"0@8032743924406704676","labels":{"<redacted>"},"level":"debug","msg":"found out of order series","org_id":"prometheus-data-prd","ts":"2023-10-03T04:32:17.865847171Z"}
{"caller":"index.go:331","component":"compactor","group":"0@{__org_id__=\"prometheus-data-prd\"}","groupKey":"0@8032743924406704676","labels":{"<redacted>"},"level":"debug","msg":"found out of order series","org_id":"prometheus-data-prd","ts":"2023-10-03T04:32:17.866020439Z"}
{"caller":"index.go:331","component":"compactor","group":"0@{__org_id__=\"prometheus-data-prd\"}","groupKey":"0@8032743924406704676","labels":{"<redacted>"},"level":"debug","msg":"found out of order series","org_id":"prometheus-data-prd","ts":"2023-10-03T04:32:17.866556481Z"}
{"caller":"index.go:331","component":"compactor","group":"0@{__org_id__=\"prometheus-data-prd\"}","groupKey":"0@8032743924406704676","labels":{"<redacted>"},"level":"debug","msg":"found out of order series","org_id":"prometheus-data-prd","ts":"2023-10-03T04:32:17.866783036Z"}
{"caller":"index.go:331","component":"compactor","group":"0@{__org_id__=\"prometheus-data-prd\"}","groupKey":"0@8032743924406704676","labels":{"<redacted>"},"level":"debug","msg":"found out of order series","org_id":"prometheus-data-prd","ts":"2023-10-03T04:32:17.866886127Z"}
{"caller":"index.go:331","component":"compactor","group":"0@{__org_id__=\"prometheus-data-prd\"}","groupKey":"0@8032743924406704676","labels":{"<redacted>"},"level":"debug","msg":"found out of order series","org_id":"prometheus-data-prd","ts":"2023-10-03T04:32:17.867022838Z"}
{"caller":"index.go:331","component":"compactor","group":"0@{__org_id__=\"prometheus-data-prd\"}","groupKey":"0@8032743924406704676","labels":{"<redacted>"},"level":"debug","msg":"found out of order series","org_id":"prometheus-data-prd","ts":"2023-10-03T04:32:17.867160204Z"}
{"caller":"compactor.go:696","component":"compactor","err":"compaction: group 0@8032743924406704676: invalid result block data/compact/0@8032743924406704676/01HBSWWRTXT3BC29KZERZJ0ZYZ: 34/359219 series have an average of 1.000 out-of-order chunks: 0.000 of these are exact duplicates (in terms of data and time range)","level":"error","msg":"failed to compact user blocks","ts":"2023-10-03T04:32:18.308412624Z","user":"prometheus-data-prd"}
yeya24 commented 10 months ago

@LERUfic May I know which version of Prometheus are you using? I am wondering if it is the same issue as https://github.com/thanos-io/thanos/issues/6723

LERUfic commented 10 months ago

sure @yeya24 I use prometheus 2.42.0

root@vm# ./prometheus --version
prometheus, version 2.42.0 (branch: HEAD, revision: 225c61122d88b01d1f0eaaee0e05b6f3e0567ac0)
  build user:       root@c67d48967507
  build date:       20230201-07:53:32
  go version:       go1.19.5
  platform:         linux/amd64

PS: opss wrong account

yeya24 commented 10 months ago

This seems like a bug in Thanos https://github.com/thanos-io/thanos/blob/main/pkg/compact/compact.go#L1394. The error cause doesn't work properly to identify the original error is out of chunk error, causing fail to ignore OOO chunks issue.

I will create an issue on Thanos side.

LERUfic commented 10 months ago

I see thank you for the response. For now I marked the chunks with no-compact-mark.json using thanos tools.

type: GCS
config:
    bucket: <redacted>
prefix: prometheus-data-prd
thanos tools bucket mark --id=<block> --marker=no-compact-mark.json --objstore.config-file=thanos.yaml --details=OOO --log.level=debug

And seems like there's no error for now but the compaction still has not finished yet. We will monitor this for a while.