cortexproject / cortex

A horizontally scalable, highly available, multi-tenant, long term Prometheus.
https://cortexmetrics.io/
Apache License 2.0
5.4k stars 786 forks source link

Calls to bucket storage on GCS fails after upgrading to 1.16.0 #5824

Open sivadeepN opened 4 months ago

sivadeepN commented 4 months ago

Describe the bug

Calls to bucket storage are failing from different components after upgrading to 1.16.0 from 1.14.1. I didnt find any extra configuration in the latest versions. Can someone help here, pasting some logs in the thread

ts=2024-03-20T08:45:51.503282442Z caller=cortex.go:444 level=error msg="module failed" module=alertmanager err="invalid service state: Failed, expected: Running, failure: failed to load alertmanager configurations for owned users: failed to fetch alertmanager config for user 01aa839c-559e-450e-a852-bfe01aac701c: Get \"https://storage.googleapis.com/test_alertmanager_api/alerts/01aa839c-559e-450e-a852-bfe01aac701c\": http2: server sent GOAWAY and closed the connection; LastStreamID=3, ErrCode=COMPRESSION_ERROR, debug=\"hpack_truncated_block\""

To Reproduce

  1. Upgrade cortex to 1.16.0 from 1.14.3

Expected behavior Cortex should keep working as before

Environment:

Additional Context

Compactor : ts=2024-03-14T08:14:16.378140292Z caller=compactor.go:644 level=error component=compactor msg="failed to discover users from bucket" err="Get \"https://storage.googleapis.com/storage/v1/b/test_blocks_integration/o?alt=json&delimiter=%2F&endOffset=&fields=nextPageToken%2Cprefixes%2Citems%28name%29&includeTrailingDelimiter=false&pageToken=&prefix=&prettyPrint=false&projection=full&startOffset=&versions=false\": http2: server sent GOAWAY and closed the connection; LastStreamID=1, ErrCode=COMPRESSION_ERROR, debug=\"hpack_truncated_block\""

ingester : [cortex-aggregator-v2-ingester-1 ingester] ts=2024-03-20T09:26:17.038736639Z caller=ingester.go:2321 level=warn msg="shipper failed to synchronize TSDB blocks with the storage" user=23835d6d-b09c-49bc-a6af-5ff74270da4c uploaded=0 err="check exists: Get \"https://storage.googleapis.com/storage/v1/b/test_blocks_integration/o/23835d6d-b09c-49bc-a6af-5ff74270da4c%2F01HSDHF39NA9ARYR4AWTW8BV2R%2Fmeta.json?alt=json&prettyPrint=false&projection=full\": http2: server sent GOAWAY and closed the connection; LastStreamID=1, ErrCode=COMPRESSION_ERROR, debug=\"hpack_truncated_block\""

`alertmanager: cluster: listen_address: 0.0.0.0:9094 peers: cortex-aggregator-v2-alertmanager-http-metrics-headless.cortex-2.svc.cluster.local:9094 data_dir: /data enable_api: true external_url: /api/prom/alertmanager alertmanager_storage: backend: gcs gcs: bucket_name: test_alertmanager_api api: prometheus_http_prefix: /prometheus response_compression_enabled: true auth_enabled: true blocks_storage: backend: gcs bucket_store: bucket_index: enabled: true chunks_cache: backend: memcached memcached: addresses: dns+cortex-aggregator-v2-memcached-blocks-${POD_ZONE:a}.cortex-2.svc.cluster.local:11211 max_async_buffer_size: 500000 max_async_concurrency: 500 max_get_multi_batch_size: 500 max_get_multi_concurrency: 1000 max_idle_connections: 500 timeout: 15s index_cache: backend: memcached memcached: addresses: dns+cortex-aggregator-v2-memcached-blocks-index-${POD_ZONE:a}.cortex-2.svc.cluster.local:11211 max_async_buffer_size: 500000 max_async_concurrency: 500 max_get_multi_batch_size: 500 max_get_multi_concurrency: 1000 max_idle_connections: 500 max_item_size: 10485760 timeout: 15s metadata_cache: backend: memcached memcached: addresses: dns+cortex-aggregator-v2-memcached-blocks-metadata.cortex-2.svc.cluster.local:11211 sync_dir: /data/tsdb-sync gcs: bucket_name: test_blocks_integration tsdb: dir: /data/tsdb max_exemplars: 10000 retention_period: 6h compactor: block_deletion_marks_migration_enabled: false sharding_enabled: true sharding_ring: kvstore: consul: host: consul:8500 store: consul distributor: pool: health_check_ingesters: true remote_timeout: 2s ring: kvstore: store: memberlist shard_by_all_labels: true frontend: grpc_client_config: grpc_compression: gzip log_queries_longer_than: 10s max_outstanding_per_tenant: 500 frontend_worker: frontend_address: cortex-aggregator-v2-query-frontend-headless:9095 grpc_client_config: backoff_config: max_period: 10s max_retries: 2 min_period: 100ms grpc_compression: gzip ingester: lifecycler: availability_zone: ${POD_ZONE} final_sleep: 30s heartbeat_period: 15s join_after: 30s num_tokens: 256 observe_period: 10s ring: heartbeat_timeout: 1m kvstore: consul: consistent_reads: false host: consul:8500 http_client_timeout: 20s prefix: collectors/ store: consul replication_factor: 3 zone_awareness_enabled: true tokens_file_path: /data/tokens ingester_client: grpc_client_config: grpc_compression: gzip max_recv_msg_size: 104857600 max_send_msg_size: 16777216 limits: enforce_metric_name: true ingestion_burst_size: 50000 ingestion_rate: 350000 ingestion_rate_strategy: global max_cache_freshness: 5m max_fetched_chunks_per_query: 2000000 max_fetched_series_per_query: 100000 max_global_series_per_user: 5600000 max_label_name_length: 1024 max_label_names_per_series: 100 max_query_lookback: 4536h max_series_per_metric: 50000 max_series_per_user: 0 reject_old_samples: false reject_old_samples_max_age: 168h memberlist: abort_if_cluster_join_fails: false bind_port: 7946 join_members:

yeya24 commented 4 months ago

Hey @sivadeepN, personally I don't use GCS so I am unsure about this error but we do have other users who use GCS experience no issues using the same GCS bucket client.

I am wondering if it is because the 1.16 release still uses an old version of the GCS client library. Can you try the latest master image on one of your container and see if the error is still here?

quay.io/cortexproject/cortex:master-065e382
sivadeepN commented 4 months ago

Even updating to the master branch didnt work, this would be a production blocker for all users on GCS.