cortexproject / cortex

A horizontally scalable, highly available, multi-tenant, long term Prometheus.
https://cortexmetrics.io/
Apache License 2.0
5.48k stars 801 forks source link

Metrics not being persisted in single binary mode #6119

Open balajisa09 opened 3 months ago

balajisa09 commented 3 months ago

Describe the bug I am running Cortex in single binary mode in kubernetes with pvc, and I have noticed that metrics are not being persisted for more than 5 hours. I have attached the config. I have a prometheus instance sending metrics to cortex via remotewrite. There are enough space in the disk too.

To Reproduce Steps to reproduce the behavior:

  1. Start Cortex with v1.17.1 version and prometheus with v2.52.0
  2. visualize the metrics via grafana or any other tool.

Expected behavior The metrics to stay for the give retention period in cortex config.

Environment:

Additional Context

cortex config:

config.yaml: |
  auth_enabled: false

  server:
    http_listen_port: 9009

    # Configure the server to allow messages up to 100MB.
    grpc_server_max_recv_msg_size: 104857600
    grpc_server_max_send_msg_size: 104857600
    grpc_server_max_concurrent_streams: 1000

    http_tls_config:
      client_auth_type: RequireAndVerifyClientCert

    grpc_tls_config:
      client_auth_type: RequireAndVerifyClientCert
    log_level: debug

  distributor:
    shard_by_all_labels: true
    pool:
      health_check_ingesters: true

  ingester_client:
    grpc_client_config:
      # Configure the client to allow messages up to 100MB.
      max_recv_msg_size: 104857600
      max_send_msg_size: 104857600
      grpc_compression: gzip

  ingester:
    lifecycler:
      # The address to advertise for this ingester.  Will be autodiscovered by
      # looking up address on eth0 or en0; can be specified if this fails.
      # address: 127.0.0.1

      # We want to start immediately and flush on shutdown.
      min_ready_duration: 0s
      final_sleep: 0s
      num_tokens: 512

      # Use an in memory ring store, so we don't need to launch a Consul.
      ring:
        kvstore:
          store: inmemory
        replication_factor: 1

  blocks_storage:
    tsdb:
      dir: /data
      retention_period: 168h

    bucket_store:
      sync_dir: /data

    backend: filesystem
    filesystem:
      dir: /data/fake

  compactor:
    data_dir: /tmp/cortex/compactor
    sharding_ring:
      kvstore:
        store: inmemory

  frontend_worker:
    match_max_concurrent: true

prometheus remotewrite config:

additionalRemoteWrite: 
- url: http://ingest.abc.com/metrics/v1/push
  writeRelabelConfigs:
  - sourceLabels: [__name__]
    regex: '.*'
    action: 'replace'
    targetLabel: 'captain_domain'
    replacement: {{ .Values.captain_domain }}
  - sourceLabels: [__name__]
    regex: '.*'
    action: 'replace'
    targetLabel: 'abc_platform_version'
    replacement: {{ .Chart.Version }}
danielblando commented 3 months ago

Do you know if the data is being deleted or just not being queried? Can you see blocks older than 5h in disk if you check /data?

Also we should have some logs when deleting blocks.

msg="Deleting obsolete block" block=blockId

Can you see those logs? Is it possible to check how old the blocksIds being deleted are? You can try to look for other logs with blockId or if luck still get info from block on disk.