cortexproject / cortex

A horizontally scalable, highly available, multi-tenant, long term Prometheus.
https://cortexmetrics.io/
Apache License 2.0
5.44k stars 790 forks source link

Ingester SIGSEGV #2938

Closed amckinley closed 4 years ago

amckinley commented 4 years ago

Just observed this in our Cortex development cluster. I'm guessing this is an issue with the kernel version that's being distributed with the cortexproject/cortex:v1.2.0 docker image.

level=info ts=2020-07-27T19:03:39.824952985Z caller=head.go:709 org_id=fake msg="WAL replay completed" duration=10m28.498852973s
level=info ts=2020-07-27T19:03:46.763232189Z caller=db.go:1244 org_id=fake msg="Compactions disabled"
level=info ts=2020-07-27T19:03:46.763345449Z caller=ingester_v2.go:1014 msg="successfully opened existing TSDBs"
level=info ts=2020-07-27T19:03:46.763519662Z caller=cortex.go:319 msg="Cortex started"
level=info ts=2020-07-27T19:03:46.777600898Z caller=lifecycler.go:541 msg="existing entry found in ring" state=ACTIVE tokens=512 ring=ingester
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0xf39efa]

goroutine 2335639 [running]:
github.com/prometheus/prometheus/tsdb.(*memSeries).iterator(0xc0340a6ee0, 0x0, 0xc2519f6e10, 0xc0002a0180, 0x0, 0x0, 0x0, 0x0)
    /go/src/github.com/cortexproject/cortex/vendor/github.com/prometheus/prometheus/tsdb/head.go:2193 +0x84a
github.com/prometheus/prometheus/tsdb.(*safeChunk).Iterator(0xc2214b4c30, 0x0, 0x0, 0xc3a7c17101, 0xc0e166b5c0)
    /go/src/github.com/cortexproject/cortex/vendor/github.com/prometheus/prometheus/tsdb/head.go:1513 +0x76
github.com/prometheus/prometheus/tsdb.(*chunkSeriesIterator).resetCurIterator(0xc0e166b5c0)
    /go/src/github.com/cortexproject/cortex/vendor/github.com/prometheus/prometheus/tsdb/querier.go:1074 +0x6b
github.com/prometheus/prometheus/tsdb.newChunkSeriesIterator(...)
    /go/src/github.com/cortexproject/cortex/vendor/github.com/prometheus/prometheus/tsdb/querier.go:1067
github.com/prometheus/prometheus/tsdb.(*chunkSeries).Iterator(0xc0e166b1a0, 0x0, 0x0)
    /go/src/github.com/cortexproject/cortex/vendor/github.com/prometheus/prometheus/tsdb/querier.go:875 +0xf6
github.com/prometheus/prometheus/tsdb.(*chainedSeriesIterator).Next(0xc2214b4c60, 0x17391006b8d)
    /go/src/github.com/cortexproject/cortex/vendor/github.com/prometheus/prometheus/tsdb/querier.go:936 +0xaa
github.com/cortexproject/cortex/pkg/ingester.(*Ingester).v2QueryStream(0xc000ab9000, 0xc2519f6690, 0x3e17a60, 0xc015afb000, 0x0, 0x0)
    /go/src/github.com/cortexproject/cortex/pkg/ingester/ingester_v2.go:744 +0x3b7
github.com/cortexproject/cortex/pkg/ingester.(*Ingester).QueryStream(0xc000ab9000, 0xc2519f6690, 0x3e17a60, 0xc015afb000, 0x0, 0x0)
    /go/src/github.com/cortexproject/cortex/pkg/ingester/ingester.go:691 +0x950
github.com/cortexproject/cortex/pkg/ingester/client._Ingester_QueryStream_Handler(0x36178a0, 0xc000ab9000, 0x3e10620, 0xc251fe7300, 0xc088728300, 0x1c)
    /go/src/github.com/cortexproject/cortex/pkg/ingester/client/cortex.pb.go:3370 +0x109
github.com/grpc-ecosystem/go-grpc-middleware/tracing/opentracing.StreamServerInterceptor.func1(0x36178a0, 0xc000ab9000, 0x3e10620, 0xc251fe7300, 0xc251fe7180, 0x382e0b8, 0x3de7400, 0xc2519f6600)
    /go/src/github.com/cortexproject/cortex/vendor/github.com/grpc-ecosystem/go-grpc-middleware/tracing/opentracing/server_interceptors.go:47 +0x144
github.com/thanos-io/thanos/pkg/tracing.StreamServerInterceptor.func1(0x36178a0, 0xc000ab9000, 0x3e123c0, 0xc251fe72e0, 0xc251fe7180, 0x382e0b8, 0xc251fe72e0, 0x33be920)
    /go/src/github.com/cortexproject/cortex/vendor/github.com/thanos-io/thanos/pkg/tracing/grpc.go:42 +0x134
github.com/grpc-ecosystem/go-grpc-middleware.ChainStreamServer.func1.1.1(0x36178a0, 0xc000ab9000, 0x3e123c0, 0xc251fe72e0, 0x2e1f720, 0xc015afafa0)
    /go/src/github.com/cortexproject/cortex/vendor/github.com/grpc-ecosystem/go-grpc-middleware/chain.go:49 +0x5f
github.com/cortexproject/cortex/pkg/cortex.glob..func3(0x36178a0, 0xc000ab9000, 0x3e10860, 0xc251fe72c0, 0xc251fe7180, 0xc251fe71a0, 0x3de7401, 0xc251fe72c0)
    /go/src/github.com/cortexproject/cortex/pkg/cortex/fake_auth.go:29 +0x136
github.com/grpc-ecosystem/go-grpc-middleware.ChainStreamServer.func1.1.1(0x36178a0, 0xc000ab9000, 0x3e10860, 0xc251fe72c0, 0x3de7400, 0xc2519f65a0)
    /go/src/github.com/cortexproject/cortex/vendor/github.com/grpc-ecosystem/go-grpc-middleware/chain.go:49 +0x5f
github.com/opentracing-contrib/go-grpc.OpenTracingStreamServerInterceptor.func1(0x36178a0, 0xc000ab9000, 0x3e10e60, 0xc261340900, 0xc251fe7180, 0xc251fe71c0, 0x0, 0x0)
    /go/src/github.com/cortexproject/cortex/vendor/github.com/opentracing-contrib/go-grpc/server.go:114 +0x34a
github.com/grpc-ecosystem/go-grpc-middleware.ChainStreamServer.func1.1.1(0x36178a0, 0xc000ab9000, 0x3e10e60, 0xc261340900, 0x3509a20, 0xc389ca3240)
    /go/src/github.com/cortexproject/cortex/vendor/github.com/grpc-ecosystem/go-grpc-middleware/chain.go:49 +0x5f
github.com/weaveworks/common/middleware.StreamServerInstrumentInterceptor.func1(0x36178a0, 0xc000ab9000, 0x3e10e60, 0xc261340900, 0xc251fe7180, 0xc251fe71e0, 0x2e970860, 0xc2ee80726c844)
    /go/src/github.com/cortexproject/cortex/vendor/github.com/weaveworks/common/middleware/grpc_instrumentation.go:42 +0x89
github.com/grpc-ecosystem/go-grpc-middleware.ChainStreamServer.func1.1.1(0x36178a0, 0xc000ab9000, 0x3e10e60, 0xc261340900, 0x203094, 0xc389ca3240)
    /go/src/github.com/cortexproject/cortex/vendor/github.com/grpc-ecosystem/go-grpc-middleware/chain.go:49 +0x5f
github.com/weaveworks/common/middleware.GRPCServerLog.StreamServerInterceptor(0x3e2c7a0, 0xc000297610, 0x40c700, 0x36178a0, 0xc000ab9000, 0x3e10e60, 0xc261340900, 0xc251fe7180, 0xc251fe7200, 0xc251fe7180, ...)
    /go/src/github.com/cortexproject/cortex/vendor/github.com/weaveworks/common/middleware/grpc_logging.go:49 +0x98
github.com/grpc-ecosystem/go-grpc-middleware.ChainStreamServer.func1.1.1(0x36178a0, 0xc000ab9000, 0x3e10e60, 0xc261340900, 0xc20a6b0c68, 0x40cfa8)
    /go/src/github.com/cortexproject/cortex/vendor/github.com/grpc-ecosystem/go-grpc-middleware/chain.go:49 +0x5f
github.com/grpc-ecosystem/go-grpc-middleware.ChainStreamServer.func1(0x36178a0, 0xc000ab9000, 0x3e10e60, 0xc261340900, 0xc251fe7180, 0x382e0b8, 0x3de7400, 0xc251551ec0)
    /go/src/github.com/cortexproject/cortex/vendor/github.com/grpc-ecosystem/go-grpc-middleware/chain.go:58 +0xcf
google.golang.org/grpc.(*Server).processStreamingRPC(0xc000488ea0, 0x3e267c0, 0xc289978d80, 0xc2513b0300, 0xc000636ff0, 0x59ed4a0, 0x0, 0x0, 0x0)
    /go/src/github.com/cortexproject/cortex/vendor/google.golang.org/grpc/server.go:1336 +0x511
google.golang.org/grpc.(*Server).handleStream(0xc000488ea0, 0x3e267c0, 0xc289978d80, 0xc2513b0300, 0x0)
    /go/src/github.com/cortexproject/cortex/vendor/google.golang.org/grpc/server.go:1409 +0xc66
google.golang.org/grpc.(*Server).serveStreams.func1.1(0xc2227dd1d0, 0xc000488ea0, 0x3e267c0, 0xc289978d80, 0xc2513b0300)
    /go/src/github.com/cortexproject/cortex/vendor/google.golang.org/grpc/server.go:746 +0xa1
created by google.golang.org/grpc.(*Server).serveStreams.func1
    /go/src/github.com/cortexproject/cortex/vendor/google.golang.org/grpc/server.go:744 +0xa1
runtime: note: your Linux kernel may be buggy
runtime: note: see https://golang.org/wiki/LinuxKernelSignalVectorBug
runtime: note: mlock workaround for kernel bug failed with errno 12

Running Cortex v1.2.0 on Kubernetes, as created by Grafana jsonnet libs. Full cortex config:

target: query-frontend
auth_enabled: false
http_prefix: /api/prom
api:
  alertmanager_http_prefix: /alertmanager
  prometheus_http_prefix: /prometheus
server:
  http_listen_address: ""
  http_listen_port: 80
  http_listen_conn_limit: 0
  grpc_listen_address: ""
  grpc_listen_port: 9095
  grpc_listen_conn_limit: 0
  http_tls_config:
    cert_file: ""
    key_file: ""
    client_auth_type: ""
    client_ca_file: ""
  grpc_tls_config:
    cert_file: ""
    key_file: ""
    client_auth_type: ""
    client_ca_file: ""
  register_instrumentation: true
  graceful_shutdown_timeout: 30s
  http_server_read_timeout: 30s
  http_server_write_timeout: 1m0s
  http_server_idle_timeout: 2m0s
  grpc_server_max_recv_msg_size: 104857600
  grpc_server_max_send_msg_size: 4194304
  grpc_server_max_concurrent_streams: 100
  grpc_server_max_connection_idle: 2562047h47m16.854775807s
  grpc_server_max_connection_age: 2562047h47m16.854775807s
  grpc_server_max_connection_age_grace: 2562047h47m16.854775807s
  grpc_server_keepalive_time: 2h0m0s
  grpc_server_keepalive_timeout: 20s
  log_level: debug
  http_path_prefix: ""
distributor:
  pool:
    client_cleanup_period: 15s
    health_check_ingesters: true
  ha_tracker:
    enable_ha_tracker: false
    ha_tracker_update_timeout: 15s
    ha_tracker_update_timeout_jitter_max: 5s
    ha_tracker_failover_timeout: 30s
    kvstore:
      store: consul
      prefix: ha-tracker/
      consul:
        host: localhost:8500
        acl_token: ""
        http_client_timeout: 20s
        consistent_reads: false
        watch_rate_limit: 1
        watch_burst_size: 1
      etcd:
        endpoints: []
        dial_timeout: 10s
        max_retries: 10
      multi:
        primary: ""
        secondary: ""
        mirror_enabled: false
        mirror_timeout: 2s
  max_recv_msg_size: 104857600
  remote_timeout: 2s
  extra_queue_delay: 0s
  shard_by_all_labels: false
  ring:
    kvstore:
      store: consul
      prefix: collectors/
      consul:
        host: localhost:8500
        acl_token: ""
        http_client_timeout: 20s
        consistent_reads: false
        watch_rate_limit: 1
        watch_burst_size: 1
      etcd:
        endpoints: []
        dial_timeout: 10s
        max_retries: 10
      multi:
        primary: ""
        secondary: ""
        mirror_enabled: false
        mirror_timeout: 2s
    heartbeat_period: 5s
    heartbeat_timeout: 1m0s
    instance_id: query-frontend-7fb499f75d-wrljr
    instance_interface_names:
    - eth0
    - en0
    instance_port: 0
    instance_addr: ""
querier:
  max_concurrent: 20
  timeout: 2m0s
  iterators: false
  batch_iterators: true
  ingester_streaming: true
  max_samples: 50000000
  query_ingesters_within: 0s
  query_store_after: 0s
  max_query_into_future: 10m0s
  default_evaluation_interval: 1m0s
  active_query_tracker_dir: ./active-query-tracker
  lookback_delta: 5m0s
  store_gateway_addresses: ""
  store_gateway_client:
    tls_cert_path: ""
    tls_key_path: ""
    tls_ca_path: ""
ingester_client:
  grpc_client_config:
    max_recv_msg_size: 104857600
    max_send_msg_size: 16777216
    use_gzip_compression: false
    rate_limit: 0
    rate_limit_burst: 0
    backoff_on_ratelimits: false
    backoff_config:
      min_period: 100ms
      max_period: 10s
      max_retries: 10
    tls_cert_path: ""
    tls_key_path: ""
    tls_ca_path: ""
ingester:
  walconfig:
    wal_enabled: false
    checkpoint_enabled: true
    recover_from_wal: false
    wal_dir: wal
    checkpoint_duration: 30m0s
  lifecycler:
    ring:
      kvstore:
        store: consul
        prefix: collectors/
        consul:
          host: localhost:8500
          acl_token: ""
          http_client_timeout: 20s
          consistent_reads: false
          watch_rate_limit: 1
          watch_burst_size: 1
        etcd:
          endpoints: []
          dial_timeout: 10s
          max_retries: 10
        multi:
          primary: ""
          secondary: ""
          mirror_enabled: false
          mirror_timeout: 2s
      heartbeat_timeout: 1m0s
      replication_factor: 3
    num_tokens: 128
    heartbeat_period: 5s
    observe_period: 0s
    join_after: 0s
    min_ready_duration: 1m0s
    interface_names:
    - eth0
    - en0
    final_sleep: 30s
    tokens_file_path: ""
    availability_zone: ""
    address: ""
    port: 0
    id: query-frontend-7fb499f75d-wrljr
  max_transfer_retries: 10
  flush_period: 1m0s
  retain_period: 5m0s
  max_chunk_idle_time: 5m0s
  max_stale_chunk_idle_time: 2m0s
  flush_op_timeout: 1m0s
  max_chunk_age: 12h0m0s
  chunk_age_jitter: 0s
  concurrent_flushes: 50
  spread_flushes: true
  metadata_retain_period: 10m0s
  rate_update_period: 15s
flusher:
  wal_dir: wal
  concurrent_flushes: 50
  flush_op_timeout: 2m0s
storage:
  engine: chunks
  aws:
    dynamodb:
      dynamodb_url: ""
      api_limit: 2
      throttle_limit: 10
      metrics:
        url: ""
        target_queue_length: 100000
        scale_up_factor: 1.3
        ignore_throttle_below: 1
        queue_length_query: sum(avg_over_time(cortex_ingester_flush_queue_length{job="cortex/ingester"}[2m]))
        write_throttle_query: sum(rate(cortex_dynamo_throttled_total{operation="DynamoDB.BatchWriteItem"}[1m])) by (table) > 0
        write_usage_query: sum(rate(cortex_dynamo_consumed_capacity_total{operation="DynamoDB.BatchWriteItem"}[15m])) by (table) > 0
        read_usage_query: sum(rate(cortex_dynamo_consumed_capacity_total{operation="DynamoDB.QueryPages"}[1h])) by (table) > 0
        read_error_query: sum(increase(cortex_dynamo_failures_total{operation="DynamoDB.QueryPages",error="ProvisionedThroughputExceededException"}[1m])) by (table) > 0
      chunk_gang_size: 10
      chunk_get_max_parallelism: 32
    s3: ""
    bucketnames: ""
    s3forcepathstyle: false
  azure:
    container_name: cortex
    account_name: ""
    account_key: ""
    download_buffer_size: 512000
    upload_buffer_size: 256000
    upload_buffer_count: 1
    request_timeout: 30s
    max_retries: 5
    min_retry_delay: 10ms
    max_retry_delay: 500ms
  bigtable:
    project: ""
    instance: ""
    grpc_client_config:
      max_recv_msg_size: 104857600
      max_send_msg_size: 16777216
      use_gzip_compression: false
      rate_limit: 0
      rate_limit_burst: 0
      backoff_on_ratelimits: false
      backoff_config:
        min_period: 100ms
        max_period: 10s
        max_retries: 10
    table_cache_enabled: true
    table_cache_expiration: 30m0s
  gcs:
    bucket_name: ""
    chunk_buffer_size: 0
    request_timeout: 0s
  cassandra:
    addresses: ""
    port: 9042
    keyspace: ""
    consistency: QUORUM
    replication_factor: 1
    disable_initial_host_lookup: false
    SSL: false
    host_verification: true
    CA_path: ""
    auth: false
    username: ""
    password: ""
    password_file: ""
    custom_authenticators: []
    timeout: 2s
    connect_timeout: 5s
    reconnect_interval: 1s
    max_retries: 0
    retry_max_backoff: 10s
    retry_min_backoff: 100ms
    query_concurrency: 0
    num_connections: 2
    convict_hosts_on_failure: true
    table_options: ""
  boltdb:
    directory: ""
  filesystem:
    directory: ""
  swift:
    auth_url: ""
    username: ""
    user_domain_name: ""
    user_domain_id: ""
    user_id: ""
    password: ""
    domain_id: ""
    domain_name: ""
    project_id: ""
    project_name: ""
    project_domain_id: ""
    project_domain_name: ""
    region_name: ""
    container_name: cortex
  index_cache_validity: 5m0s
  index_queries_cache_config:
    enable_fifocache: false
    default_validity: 0s
    background:
      writeback_goroutines: 10
      writeback_buffer: 10000
    memcached:
      expiration: 0s
      batch_size: 1024
      parallelism: 100
    memcached_client:
      host: ""
      service: memcached
      addresses: ""
      timeout: 100ms
      max_idle_conns: 16
      update_interval: 1m0s
      consistent_hash: true
    redis:
      endpoint: ""
      timeout: 100ms
      expiration: 0s
      max_idle_conns: 80
      max_active_conns: 0
      password: ""
      enable_tls: false
      idle_timeout: 0s
      wait_on_pool_exhaustion: false
      max_conn_lifetime: 0s
    fifocache:
      max_size_bytes: ""
      max_size_items: 0
      validity: 0s
      size: 0
    prefix: store.index-cache-read.
  delete_store:
    store: ""
    requests_table_name: delete_requests
    table_provisioning:
      enable_ondemand_throughput_mode: false
      provisioned_write_throughput: 1
      provisioned_read_throughput: 300
      write_scale:
        enabled: false
        role_arn: ""
        min_capacity: 3000
        max_capacity: 6000
        out_cooldown: 1800
        in_cooldown: 1800
        target: 80
      read_scale:
        enabled: false
        role_arn: ""
        min_capacity: 3000
        max_capacity: 6000
        out_cooldown: 1800
        in_cooldown: 1800
        target: 80
      tags: {}
  grpc_store: {}
chunk_store:
  chunk_cache_config:
    enable_fifocache: false
    default_validity: 0s
    background:
      writeback_goroutines: 10
      writeback_buffer: 10000
    memcached:
      expiration: 0s
      batch_size: 1024
      parallelism: 100
    memcached_client:
      host: ""
      service: memcached
      addresses: ""
      timeout: 100ms
      max_idle_conns: 16
      update_interval: 1m0s
      consistent_hash: true
    redis:
      endpoint: ""
      timeout: 100ms
      expiration: 0s
      max_idle_conns: 80
      max_active_conns: 0
      password: ""
      enable_tls: false
      idle_timeout: 0s
      wait_on_pool_exhaustion: false
      max_conn_lifetime: 0s
    fifocache:
      max_size_bytes: ""
      max_size_items: 0
      validity: 0s
      size: 0
    prefix: store.chunks-cache.
  write_dedupe_cache_config:
    enable_fifocache: false
    default_validity: 0s
    background:
      writeback_goroutines: 10
      writeback_buffer: 10000
    memcached:
      expiration: 0s
      batch_size: 1024
      parallelism: 100
    memcached_client:
      host: ""
      service: memcached
      addresses: ""
      timeout: 100ms
      max_idle_conns: 16
      update_interval: 1m0s
      consistent_hash: true
    redis:
      endpoint: ""
      timeout: 100ms
      expiration: 0s
      max_idle_conns: 80
      max_active_conns: 0
      password: ""
      enable_tls: false
      idle_timeout: 0s
      wait_on_pool_exhaustion: false
      max_conn_lifetime: 0s
    fifocache:
      max_size_bytes: ""
      max_size_items: 0
      validity: 0s
      size: 0
    prefix: store.index-cache-write.
  cache_lookups_older_than: 0s
  max_look_back_period: 0s
schema:
  configs: []
limits:
  ingestion_rate: 25000
  ingestion_rate_strategy: local
  ingestion_burst_size: 50000
  accept_ha_samples: false
  ha_cluster_label: cluster
  ha_replica_label: __replica__
  drop_labels: []
  max_label_name_length: 1024
  max_label_value_length: 2048
  max_label_names_per_series: 30
  max_metadata_length: 1024
  reject_old_samples: false
  reject_old_samples_max_age: 336h0m0s
  creation_grace_period: 10m0s
  enforce_metadata_metric_name: true
  enforce_metric_name: true
  user_subring_size: 0
  max_series_per_query: 100000
  max_samples_per_query: 1000000
  max_series_per_user: 5000000
  max_series_per_metric: 50000
  max_global_series_per_user: 0
  max_global_series_per_metric: 0
  min_chunk_length: 0
  max_metadata_per_user: 8000
  max_metadata_per_metric: 10
  max_global_metadata_per_user: 0
  max_global_metadata_per_metric: 0
  max_chunks_per_query: 2000000
  max_query_length: 12000h0m0s
  max_query_parallelism: 14
  cardinality_limit: 100000
  max_cache_freshness: 10m0s
  per_tenant_override_config: /etc/cortex/overrides.yaml
  per_tenant_override_period: 10s
prealloc: {}
frontend_worker:
  frontend_address: ""
  parallelism: 10
  match_max_concurrent: false
  dns_lookup_duration: 10s
  grpc_client_config:
    max_recv_msg_size: 104857600
    max_send_msg_size: 16777216
    use_gzip_compression: false
    rate_limit: 0
    rate_limit_burst: 0
    backoff_on_ratelimits: false
    backoff_config:
      min_period: 100ms
      max_period: 10s
      max_retries: 10
    tls_cert_path: ""
    tls_key_path: ""
    tls_ca_path: ""
frontend:
  max_outstanding_per_tenant: 100
  compress_responses: true
  downstream_url: ""
  log_queries_longer_than: 0s
query_range:
  split_queries_by_interval: 24h0m0s
  split_queries_by_day: false
  align_queries_with_step: true
  results_cache:
    cache:
      enable_fifocache: false
      default_validity: 0s
      background:
        writeback_goroutines: 10
        writeback_buffer: 10000
      memcached:
        expiration: 0s
        batch_size: 1024
        parallelism: 100
      memcached_client:
        host: memcached-frontend.cortex-tsdb.svc.cluster.local
        service: memcached-client
        addresses: ""
        timeout: 500ms
        max_idle_conns: 16
        update_interval: 1m0s
        consistent_hash: true
      redis:
        endpoint: ""
        timeout: 100ms
        expiration: 0s
        max_idle_conns: 80
        max_active_conns: 0
        password: ""
        enable_tls: false
        idle_timeout: 0s
        wait_on_pool_exhaustion: false
        max_conn_lifetime: 0s
      fifocache:
        max_size_bytes: ""
        max_size_items: 0
        validity: 0s
        size: 0
      prefix: frontend.
    max_freshness: 0s
  cache_results: true
  max_retries: 5
  parallelise_shardable_queries: false
table_manager:
  throughput_updates_disabled: false
  retention_deletes_enabled: false
  retention_period: 0s
  poll_interval: 2m0s
  creation_grace_period: 10m0s
  index_tables_provisioning:
    enable_ondemand_throughput_mode: false
    provisioned_write_throughput: 1000
    provisioned_read_throughput: 300
    write_scale:
      enabled: false
      role_arn: ""
      min_capacity: 3000
      max_capacity: 6000
      out_cooldown: 1800
      in_cooldown: 1800
      target: 80
    read_scale:
      enabled: false
      role_arn: ""
      min_capacity: 3000
      max_capacity: 6000
      out_cooldown: 1800
      in_cooldown: 1800
      target: 80
    enable_inactive_throughput_on_demand_mode: false
    inactive_write_throughput: 1
    inactive_read_throughput: 300
    inactive_write_scale:
      enabled: false
      role_arn: ""
      min_capacity: 3000
      max_capacity: 6000
      out_cooldown: 1800
      in_cooldown: 1800
      target: 80
    inactive_read_scale:
      enabled: false
      role_arn: ""
      min_capacity: 3000
      max_capacity: 6000
      out_cooldown: 1800
      in_cooldown: 1800
      target: 80
    inactive_write_scale_lastn: 4
    inactive_read_scale_lastn: 4
  chunk_tables_provisioning:
    enable_ondemand_throughput_mode: false
    provisioned_write_throughput: 1000
    provisioned_read_throughput: 300
    write_scale:
      enabled: false
      role_arn: ""
      min_capacity: 3000
      max_capacity: 6000
      out_cooldown: 1800
      in_cooldown: 1800
      target: 80
    read_scale:
      enabled: false
      role_arn: ""
      min_capacity: 3000
      max_capacity: 6000
      out_cooldown: 1800
      in_cooldown: 1800
      target: 80
    enable_inactive_throughput_on_demand_mode: false
    inactive_write_throughput: 1
    inactive_read_throughput: 300
    inactive_write_scale:
      enabled: false
      role_arn: ""
      min_capacity: 3000
      max_capacity: 6000
      out_cooldown: 1800
      in_cooldown: 1800
      target: 80
    inactive_read_scale:
      enabled: false
      role_arn: ""
      min_capacity: 3000
      max_capacity: 6000
      out_cooldown: 1800
      in_cooldown: 1800
      target: 80
    inactive_write_scale_lastn: 4
    inactive_read_scale_lastn: 4
tsdb:
  dir: tsdb
  block_ranges_period:
  - 2h0m0s
  retention_period: 6h0m0s
  ship_interval: 1m0s
  ship_concurrency: 10
  backend: s3
  bucket_store:
    sync_dir: tsdb-sync
    sync_interval: 5m0s
    max_chunk_pool_bytes: 2147483648
    max_sample_count: 0
    max_concurrent: 20
    tenant_sync_concurrency: 10
    block_sync_concurrency: 20
    meta_sync_concurrency: 20
    consistency_delay: 0s
    index_cache:
      backend: inmemory
      inmemory:
        max_size_bytes: 1073741824
      memcached:
        addresses: ""
        timeout: 100ms
        max_idle_connections: 16
        max_async_concurrency: 50
        max_async_buffer_size: 10000
        max_get_multi_concurrency: 100
        max_get_multi_batch_size: 0
        max_item_size: 1048576
      postings_compression_enabled: false
    chunks_cache:
      backend: ""
      memcached:
        addresses: ""
        timeout: 100ms
        max_idle_connections: 16
        max_async_concurrency: 50
        max_async_buffer_size: 10000
        max_get_multi_concurrency: 100
        max_get_multi_batch_size: 0
        max_item_size: 1048576
      subrange_size: 16000
      max_get_range_requests: 3
      attributes_ttl: 24h0m0s
      subrange_ttl: 24h0m0s
    metadata_cache:
      backend: ""
      memcached:
        addresses: ""
        timeout: 100ms
        max_idle_connections: 16
        max_async_concurrency: 50
        max_async_buffer_size: 10000
        max_get_multi_concurrency: 100
        max_get_multi_batch_size: 0
        max_item_size: 1048576
      tenants_list_ttl: 15m0s
      tenant_blocks_list_ttl: 15m0s
      chunks_list_ttl: 24h0m0s
      metafile_exists_ttl: 2h0m0s
      metafile_doesnt_exist_ttl: 15m0s
      metafile_content_ttl: 24h0m0s
      metafile_max_size_bytes: 1048576
    ignore_deletion_mark_delay: 6h0m0s
    postings_offsets_in_mem_sampling: 32
  head_compaction_interval: 1m0s
  head_compaction_concurrency: 5
  stripe_size: 16384
  wal_compression_enabled: false
  store_gateway_enabled: false
  max_tsdb_opening_concurrency_on_startup: 10
  s3:
    endpoint: ""
    bucket_name: ""
    secret_access_key: ""
    access_key_id: ""
    insecure: false
  gcs:
    bucket_name: ""
    service_account: ""
  azure:
    account_name: ""
    account_key: ""
    container_name: ""
    endpoint_suffix: ""
    max_retries: 20
  filesystem:
    dir: ""
compactor:
  block_ranges:
  - 2h0m0s
  - 12h0m0s
  - 24h0m0s
  block_sync_concurrency: 20
  meta_sync_concurrency: 20
  consistency_delay: 0s
  data_dir: ./data
  compaction_interval: 1h0m0s
  compaction_retries: 3
  compaction_concurrency: 1
  deletion_delay: 12h0m0s
  sharding_enabled: false
  sharding_ring:
    kvstore:
      store: consul
      prefix: collectors/
      consul:
        host: localhost:8500
        acl_token: ""
        http_client_timeout: 20s
        consistent_reads: false
        watch_rate_limit: 1
        watch_burst_size: 1
      etcd:
        endpoints: []
        dial_timeout: 10s
        max_retries: 10
      multi:
        primary: ""
        secondary: ""
        mirror_enabled: false
        mirror_timeout: 2s
    heartbeat_period: 5s
    heartbeat_timeout: 1m0s
    instance_id: query-frontend-7fb499f75d-wrljr
    instance_interface_names:
    - eth0
    - en0
    instance_port: 0
    instance_addr: ""
store_gateway:
  sharding_enabled: false
  sharding_ring:
    kvstore:
      store: consul
      prefix: collectors/
      consul:
        host: localhost:8500
        acl_token: ""
        http_client_timeout: 20s
        consistent_reads: false
        watch_rate_limit: 1
        watch_burst_size: 1
      etcd:
        endpoints: []
        dial_timeout: 10s
        max_retries: 10
      multi:
        primary: ""
        secondary: ""
        mirror_enabled: false
        mirror_timeout: 2s
    heartbeat_period: 15s
    heartbeat_timeout: 1m0s
    replication_factor: 3
    tokens_file_path: ""
    instance_id: query-frontend-7fb499f75d-wrljr
    instance_interface_names:
    - eth0
    - en0
    instance_port: 0
    instance_addr: ""
purger:
  enable: false
  num_workers: 2
  object_store_type: ""
  delete_request_cancel_period: 24h0m0s
ruler:
  external_url: ""
  ruler_client:
    tls_cert_path: ""
    tls_key_path: ""
    tls_ca_path: ""
  evaluation_interval: 1m0s
  evaluation_delay_duration: 0s
  poll_interval: 1m0s
  storage:
    type: configdb
    configdb:
      configs_api_url: ""
      client_timeout: 5s
      tls_cert_path: ""
      tls_key_path: ""
      tls_ca_path: ""
    azure:
      container_name: cortex
      account_name: ""
      account_key: ""
      download_buffer_size: 512000
      upload_buffer_size: 256000
      upload_buffer_count: 1
      request_timeout: 30s
      max_retries: 5
      min_retry_delay: 10ms
      max_retry_delay: 500ms
    gcs:
      bucket_name: ""
      chunk_buffer_size: 0
      request_timeout: 0s
    s3:
      s3: ""
      bucketnames: ""
      s3forcepathstyle: false
    swift:
      auth_url: ""
      username: ""
      user_domain_name: ""
      user_domain_id: ""
      user_id: ""
      password: ""
      domain_id: ""
      domain_name: ""
      project_id: ""
      project_name: ""
      project_domain_id: ""
      project_domain_name: ""
      region_name: ""
      container_name: cortex
  rule_path: /rules
  alertmanager_url: ""
  enable_alertmanager_discovery: false
  alertmanager_refresh_interval: 1m0s
  enable_alertmanager_v2: false
  notification_queue_capacity: 10000
  notification_timeout: 10s
  enable_sharding: false
  search_pending_for: 5m0s
  ring:
    kvstore:
      store: consul
      prefix: rulers/
      consul:
        host: localhost:8500
        acl_token: ""
        http_client_timeout: 20s
        consistent_reads: false
        watch_rate_limit: 1
        watch_burst_size: 1
      etcd:
        endpoints: []
        dial_timeout: 10s
        max_retries: 10
      multi:
        primary: ""
        secondary: ""
        mirror_enabled: false
        mirror_timeout: 2s
    heartbeat_period: 5s
    heartbeat_timeout: 1m0s
    instance_id: query-frontend-7fb499f75d-wrljr
    instance_interface_names:
    - eth0
    - en0
    instance_port: 0
    instance_addr: ""
    num_tokens: 128
  flush_period: 1m0s
  enable_api: false
configs:
  database:
    uri: postgres://postgres@configs-db.weave.local/configs?sslmode=disable
    migrations_dir: ""
    password_file: ""
  api:
    notifications:
      disable_email: false
      disable_webhook: false
alertmanager:
  data_dir: data/
  retention: 120h0m0s
  external_url: ""
  poll_interval: 15s
  cluster_bind_address: 0.0.0.0:9094
  cluster_advertise_address: ""
  peers: []
  peer_timeout: 15s
  fallback_config_file: ""
  auto_webhook_root: ""
  storage:
    type: configdb
    configdb:
      configs_api_url: ""
      client_timeout: 5s
      tls_cert_path: ""
      tls_key_path: ""
      tls_ca_path: ""
    local:
      path: ""
runtime_config:
  period: 10s
  file: /etc/cortex/overrides.yaml
memberlist:
  node_name: ""
  randomize_node_name: true
  stream_timeout: 0s
  retransmit_factor: 0
  pull_push_interval: 0s
  gossip_interval: 0s
  gossip_nodes: 0
  gossip_to_dead_nodes_time: 0s
  dead_node_reclaim_time: 0s
  join_members: []
  min_join_backoff: 1s
  max_join_backoff: 1m0s
  max_join_retries: 10
  abort_if_cluster_join_fails: true
  rejoin_interval: 0s
  left_ingesters_timeout: 5m0s
  leave_timeout: 5s
  bind_addr: []
  bind_port: 7946
  packet_dial_timeout: 5s
  packet_write_timeout: 5s
pstibrany commented 4 years ago

Thanks for the report. Kernel is not part of distributed docker images, and is completely in your control.

Based on https://github.com/golang/go/wiki/LinuxKernelSignalVectorBug, there is a possibility that increasing amount of memory that program can lock (ulimit -l) will resolve the problem. Another possibility is to use GODEBUG=asyncpreemptoff=1 which however doesn’t fix the problem, only makes it less likely to occur.

If possible, updating Linux kernel may also be a solution.

The bug was fixed in Linux kernel versions 5.3.15, 5.4.2, and 5.5 and later.

amckinley commented 4 years ago

Kernel is not part of distributed docker images, and is completely in your control.

Oh duh; my mistake. We'll look into upgrading the kernel on the k8s nodes hosting our Cortex pods, or try your other suggestions if we can't accomplish that. Assuming there's nothing else to look at on your end, feel free to close this out.

amckinley commented 4 years ago

Actually, I'm now convinced that the message at the bottom of the stack trace is unrelated. We're running a kernel version that shouldn't have this issue: 5.3.0-3-amd64 #1 SMP Debian 5.3.15-1 (2019-12-07) x86_64 GNU/Linux

I think this is actually a bug in the Prometheus version that Cortex is using, which was fixed by this commit: https://github.com/prometheus/prometheus/commit/d30492cbb0ec781811e9cbc1c7fb4603b3e33606#diff-6fb13507bfadf9819fea5bda61f599e6

Is it difficult to update Cortex's dependency on Prometheus? I think you at least need to cherry-pick that^^ one fix.

pstibrany commented 4 years ago

That commit fixes the case when WAL is corrupted, and some WAL segments have been deleted as a result. You should see log message about that.

Cortex tracks Prometheus master quite closely, ~so this bugfix will be part of Cortex very soon. (Our last Prometheus update to latest master was 5 days ago, bugfix got merged the day after)~

Actually, this bugfix is already in Cortex master.

pracucci commented 4 years ago

I guess we can consider this issue closed by https://github.com/cortexproject/cortex/pull/2902, but please feel free to re-open it if that's not the case. Thanks!

uruddarraju commented 4 years ago

Does cortex usually cherry pick important fixes like this into the releases or do we just have to wait for an other release to be cut?(prefer not deploying off master here)

The fix for this issue(which definitely seems to be the prometheus commit linked above) is not present in 1.2.0 and is only available on the master.

pstibrany commented 4 years ago

Does cortex usually cherry pick important fixes like this into the releases or do we just have to wait for an other release to eb cut?(prefer not deploying off master here)

Cortex doesn’t cherry-pick bugfixes for experimental features into existing release branches.

Release process for Cortex 1.3.0 should start sometime next week, with final release likely the week after.

pstibrany commented 4 years ago

Note that there is a way to get rid of this problem: by removing chunks_head directory inside ingester, in per-tenant TSDB directory. It is safe to delete this directory while ingester is NOT running. No data is lost in the process.