cortexproject / cortex

A horizontally scalable, highly available, multi-tenant, long term Prometheus.
https://cortexmetrics.io/
Apache License 2.0
5.46k stars 795 forks source link

querier: setting store gateway sharding-enabled is flaky when using cli args in 1.17.x #5967

Open paulcostinean opened 5 months ago

paulcostinean commented 5 months ago

Describe the bug

Upgraded to 1.17 in one setup with no issues, but the querier started crashlopping in a second instance. The order of parameters seems to have some side effects.

To Reproduce Steps to reproduce the behavior:

  1. Start Cortex (1.17) querier with the following arguments:

        - -target=querier
        - -server.http-listen-port=80
        - -config.file=/etc/cortex/config.yaml
        - -blocks-storage.s3.insecure=true
        - -blocks-storage.bucket-store.bucket-index.enabled=true
        - -consul.hostname=<redacted>:8500
        - -blocks-storage.bucket-store.index-cache.memcached.addresses=<redacted>:11211
        - -blocks-storage.bucket-store.index-cache.memcached.timeout=200ms
        - -blocks-storage.bucket-store.index-cache.backend=memcached
        - -blocks-storage.bucket-store.chunks-cache.memcached.addresses=<redacted>:11211
        - -blocks-storage.bucket-store.chunks-cache.memcached.timeout=200ms
        - -blocks-storage.bucket-store.chunks-cache.backend=memcached
        - -blocks-storage.bucket-store.metadata-cache.memcached.addresses=<redacted>:11211
        - -blocks-storage.bucket-store.metadata-cache.memcached.timeout=200ms
        - -blocks-storage.bucket-store.metadata-cache.backend=memcached
        - -querier.frontend-address=<redacted>:9095
    [...]
        - -store-gateway.sharding-enabled
        - -store-gateway.sharding-ring.consul.hostname=<redacted>:8500
        - -store-gateway.sharding-ring.replication-factor=3
  2. Perform Operations(Read/Write/Others) The querier returns the following error:

ts=2024-05-22T10:45:09.408819897Z caller=main.go:199 level=info msg="Starting Cortex" version="(version=1.17.1, branch=HEAD, revision=62b2513)"  
ts=2024-05-22T10:45:09.409043709Z caller=server.go:319 level=info http=[::]:80 grpc=[::]:9095 msg="server listening on addresses"                
ts=2024-05-22T10:45:09.411642411Z caller=memcached.go:49 level=info msg="created memcached cache"                                                
ts=2024-05-22T10:45:09.413174642Z caller=memcached.go:49 level=info msg="created memcached cache"                                                
ts=2024-05-22T10:45:09.413355343Z caller=log.go:121 level=error msg="error running cortex" err="failed to initialize querier: no store-gateway ad
dress configured\nerror initialising module: store-queryable\ngithub.com/cortexproject/cortex/pkg/util/modules.(*Manager).initModule\n\t/__w/cort
ex/cortex/pkg/util/modules/modules.go:108\ngithub.com/cortexproject/cortex/pkg/util/modules.(*Manager).InitModuleServices\n\t/__w/cortex/cortex/p
kg/util/modules/modules.go:78\ngithub.com/cortexproject/cortex/pkg/cortex.(*Cortex).Run\n\t/__w/cortex/cortex/pkg/cortex/cortex.go:410\nmain.main
\n\t/__w/cortex/cortex/cmd/cortex/main.go:201\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:267\nruntime.goexit\n\t/usr/local/go/src/runtime
/asm_amd64.s:1650"

Expected behavior The querier should've started and store-gateway addresses should be fetched from the sharding ring (consul based)

Flipping store-gateway specific arguments around works

From (broken):

        - -store-gateway.sharding-enabled
        - -store-gateway.sharding-ring.consul.hostname=<redacted>:8500
        - -store-gateway.sharding-ring.replication-factor=3

To (working):

        - -store-gateway.sharding-ring.replication-factor=3
        - -store-gateway.sharding-enabled
        - -store-gateway.sharding-ring.consul.hostname=<redacted>:8500

I think the struct used for sharding-ring is not initialised when setting the store-gateway.sharding-enabled parameter in the querier.

Environment:

friedrichg commented 4 months ago

thanks for reporting. Such a weird bug.

alanprot commented 4 months ago
# Shard blocks across multiple store gateway instances. This option needs be set
# both on the store-gateway and querier when running in microservices mode.
# CLI flag: -store-gateway.sharding-enabled
[sharding_enabled: <boolean> | default = false]

I think this CLI should be -store-gateway.sharding-enabled=true

paulcostinean commented 4 months ago

Short form works for toggling BoolVar I just tried it and it is reproducible with the following:

        - -store-gateway.sharding-enabled=true
        - -store-gateway.sharding-ring.consul.hostname=<redacted>:8500
        - -store-gateway.sharding-ring.replication-factor=3
yeya24 commented 2 months ago

Unfortunately I cannot reproduce this issue on my end. And I don't see how it matters by changing the order of the CLI arguments.