cortexproject / cortex

A horizontally scalable, highly available, multi-tenant, long term Prometheus.
https://cortexmetrics.io/
Apache License 2.0
5.45k stars 791 forks source link

Ruler fails to start without helpful logs #3124

Closed amckinley closed 4 years ago

amckinley commented 4 years ago

This is very similar to https://github.com/cortexproject/cortex/issues/2991. In this case, it's the ruler that's failing to start:

level=info ts=2020-09-03T01:18:10.009762812Z caller=main.go:162 msg="Starting Cortex" version="(version=1.3.0, branch=HEAD, revision=90cf57435)"
level=info ts=2020-09-03T01:18:10.009938816Z caller=server.go:194 http=[::]:80 grpc=[::]:9095 msg="server listening on addresses"
level=info ts=2020-09-03T01:18:10.014732218Z caller=memcached.go:47 msg="created memcached cache"
level=warn ts=2020-09-03T01:18:10.014984347Z caller=experimental.go:19 msg="experimental feature in use" feature="Blocks storage engine"
level=info ts=2020-09-03T01:18:10.015787248Z caller=module_service.go:58 msg=initialising module=server
level=info ts=2020-09-03T01:18:10.015857766Z caller=module_service.go:58 msg=initialising module=store-queryable
level=info ts=2020-09-03T01:18:10.015878462Z caller=module_service.go:58 msg=initialising module=runtime-config
level=info ts=2020-09-03T01:18:10.016453162Z caller=module_service.go:58 msg=initialising module=memberlist-kv
level=info ts=2020-09-03T01:18:10.01692631Z caller=module_service.go:58 msg=initialising module=ring
level=info ts=2020-09-03T01:18:10.017106419Z caller=module_service.go:58 msg=initialising module=distributor
level=info ts=2020-09-03T01:18:10.11931313Z caller=fetcher.go:453 org_id=fake component=block.BaseFetcher msg="successfully synchronized block metadata" duration=101.319391ms cached=178 returned=82 partial=0
level=info ts=2020-09-03T01:18:11.836562804Z caller=fetcher.go:453 org_id=fake component=block.BaseFetcher msg="successfully synchronized block metadata" duration=65.240933ms cached=178 returned=82 partial=0
level=info ts=2020-09-03T01:18:15.53879203Z caller=fetcher.go:453 org_id=fake component=block.BaseFetcher msg="successfully synchronized block metadata" duration=63.360277ms cached=178 returned=82 partial=0
level=error ts=2020-09-03T01:18:15.548938974Z caller=cortex.go:319 msg="module failed" module=store-queryable err="invalid service state: Failed, expected: Running, failure: unable to start blocks storage queryable subservices: not healthy"
level=error ts=2020-09-03T01:18:15.548999735Z caller=cortex.go:319 msg="module failed" module=ruler err="failed to start ruler, because it depends on module store-queryable, which has failed: context canceled"
level=error ts=2020-09-03T01:18:15.549008571Z caller=client.go:198 msg="error while rate-limiting" key=store-gateway err="context canceled"
level=info ts=2020-09-03T01:18:15.549063192Z caller=module_service.go:90 msg="module stopped" module=distributor
level=error ts=2020-09-03T01:18:15.549095769Z caller=client.go:198 msg="error while rate-limiting" key=ring err="context canceled"
level=info ts=2020-09-03T01:18:15.549115661Z caller=module_service.go:90 msg="module stopped" module=ring
level=info ts=2020-09-03T01:18:15.54913489Z caller=module_service.go:90 msg="module stopped" module=runtime-config
level=info ts=2020-09-03T01:18:15.549158497Z caller=module_service.go:90 msg="module stopped" module=memberlist-kv
level=info ts=2020-09-03T01:18:15.549250071Z caller=server_service.go:50 msg="server stopped"
level=info ts=2020-09-03T01:18:15.549264317Z caller=module_service.go:90 msg="module stopped" module=server
level=info ts=2020-09-03T01:18:15.549274603Z caller=cortex.go:308 msg="Cortex stopped"
level=error ts=2020-09-03T01:18:15.549319793Z caller=log.go:143 msg="error running cortex" err="failed services\ngithub.com/cortexproject/cortex/pkg/cortex.(*Cortex).Run\n\t/go/src/github.com/cortexproject/cortex/pkg/cortex/cortex.go:354\nmain.main\n\t/go/src/github.com/cortexproject/cortex/cmd/cortex/main.go:164\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:203\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1373"

We're running cortex v.1.3.0. Here's the ruler's config:

      -auth.enabled=false
      -consul.hostname=consul.cortex-tsdb.svc.cluster.local:8500
      -distributor.health-check-ingesters=true
      -distributor.replication-factor=3
      -distributor.shard-by-all-labels=true
      -dynamodb.api-limit=10
      -dynamodb.url=https://us-west-2
      -experimental.blocks-storage.bucket-store.ignore-deletion-marks-delay=1h
      -experimental.blocks-storage.bucket-store.metadata-cache.backend=memcached
      -experimental.blocks-storage.bucket-store.metadata-cache.memcached.addresses=dnssrvnoa+memcached-metadata.cortex-tsdb.svc.cluster.local:11211
      -experimental.blocks-storage.bucket-store.metadata-cache.memcached.max-async-buffer-size=25000
      -experimental.blocks-storage.bucket-store.metadata-cache.memcached.max-async-concurrency=50
      -experimental.blocks-storage.bucket-store.metadata-cache.memcached.max-get-multi-batch-size=100
      -experimental.blocks-storage.bucket-store.metadata-cache.memcached.max-item-size=1048576
      -experimental.blocks-storage.bucket-store.metadata-cache.memcached.timeout=200ms
      -experimental.blocks-storage.bucket-store.sync-dir=/data/tsdb
      -experimental.blocks-storage.s3.bucket-name=robinhood-cortex-dev-tsdb
      -experimental.blocks-storage.s3.endpoint=s3.us-west-2.amazonaws.com
      -experimental.blocks-storage.tsdb.block-ranges-period=2h
      -experimental.blocks-storage.tsdb.dir=/data/tsdb
      -experimental.blocks-storage.tsdb.retention-period=96h
      -experimental.blocks-storage.tsdb.ship-interval=1m
      -experimental.ruler.enable-api=true
      -experimental.store-gateway.replication-factor=3
      -experimental.store-gateway.sharding-enabled=true
      -experimental.store-gateway.sharding-ring.consul.hostname=consul.cortex-tsdb.svc.cluster.local:8500
      -experimental.store-gateway.sharding-ring.prefix=
      -experimental.store-gateway.sharding-ring.store=consul
      -limits.per-user-override-config=/etc/cortex/overrides.yaml
      -log.level=debug
      -querier.query-ingesters-within=13h
      -querier.query-store-after=12h
      -ring.heartbeat-timeout=10m
      -ring.prefix=
      -ruler.alertmanager-url=http://alertmanager.cortex-tsdb.svc.cluster.local/alertmanager
      -ruler.enable-sharding=true
      -ruler.ring.consul.hostname=consul.cortex-tsdb.svc.cluster.local:8500
      -ruler.rule-path=/etc/cortex/ruler/rules
      -ruler.storage.local.directory=/etc/cortex/ruler
      -ruler.storage.type=local
      -s3.url=https://us-west-2/robinhood-cortex-dev-tsdb
      -schema-config-file=/etc/cortex/schema/config.yaml
      -store.cardinality-limit=1000000
      -store.engine=blocks
      -store.max-query-length=744h
      -target=ruler

One of the diagnoses from the previous issue was an unhealthy consul cluster, which is not a problem we appear to be having since all the other components are healthy.

jpdstan commented 4 years ago

For whoever looks at this issue, here is the same configuration but with log.level: 'debug' https://pastebin.com/NukE3AaQ

jpdstan commented 4 years ago

Update on this: we also tried this out with s3 as a backing store instead of local disk, we got the same error. specifically replacing

      -ruler.storage.local.directory=/etc/cortex/ruler
      -ruler.storage.type=local

with

      -ruler.storage.s3.url=<s3_url>
      -ruler.storage.type=s3
pstibrany commented 4 years ago

Thank you for your reports. You're right that these logs don't contain the error from "blocks storage queryable" subservice, so it's difficult to diagnose what's wrong. I've extended error reporting on this code path in #3125, if you have a chance to give it a try, that would be helpful.

amckinley commented 4 years ago

That's great! Any chance there's going to be a v1.4 release candidate cut with #3125 any time soon? If we need to do a custom build of cortex to test this, we can, but right now we're running the official Dockerhub v1.3.0 release, so we'd have to build our own container.

pstibrany commented 4 years ago

Next release cycle will start in two weeks (week starting 14th of September), with final release likely the week after.

amckinley commented 4 years ago

Yep, this new logging is very helpful. This is obviously our problem to solve, but here's the new error we're seeing:

level=info ts=2020-09-03T20:20:07.547735931Z caller=module_service.go:58 msg=initialising module=ring
level=info ts=2020-09-03T20:20:07.547968583Z caller=module_service.go:58 msg=initialising module=distributor-service
level=debug ts=2020-09-03T20:20:07.548262851Z caller=module_service.go:48 msg="module waiting for initialization" module=ruler waiting_for=store-queryable
level=info ts=2020-09-03T20:20:07.647066619Z caller=fetcher.go:453 org_id=fake component=block.BaseFetcher msg="successfully synchronized block metadata" duration=98.064503ms cached=163 returned=81 partial=0
level=info ts=2020-09-03T20:20:08.661072094Z caller=module_service.go:58 msg=initialising module=ruler
level=info ts=2020-09-03T20:20:08.666964583Z caller=basic_lifecycler.go:242 msg="instance not found in the ring" instance=ruler-6567dd65f7-j8zsh ring=ruler
level=info ts=2020-09-03T20:20:08.668821489Z caller=ruler.go:345 msg="ruler up and running"
level=info ts=2020-09-03T20:20:08.668890952Z caller=cortex.go:315 msg="Cortex started"
level=debug ts=2020-09-03T20:20:08.669503307Z caller=ruler.go:316 msg="rule group not owned, address does not match" owner_addr=10.85.127.207:9095 addr=10.85.107.91:9095
level=debug ts=2020-09-03T20:20:08.669529559Z caller=ruler.go:316 msg="rule group not owned, address does not match" owner_addr=10.85.39.130:9095 addr=10.85.107.91:9095
level=error ts=2020-09-03T20:20:08.669560437Z caller=manager.go:126 msg="unable to map rule files" user=..2020_09_03_20_20_02.632031398 err="mkdir /etc/cortex/ruler/rules: read-only file system"
level=debug ts=2020-09-03T20:20:18.284269479Z caller=logging.go:66 traceID=68de0a0664a661cb msg="GET /ready (200) 61.21µs"
level=debug ts=2020-09-03T20:20:28.284203914Z caller=logging.go:66 traceID=4a584fa842a1e1c0 msg="GET /ready (200) 45.996µs"
level=debug ts=2020-09-03T20:20:38.284184123Z caller=logging.go:66 traceID=536be7035ca8de51 msg="GET /ready (200) 36.714µs"
level=debug ts=2020-09-03T20:20:48.284152674Z caller=logging.go:66 traceID=70740d96899a43c1 msg="GET /ready (200) 38.596µs"
level=debug ts=2020-09-03T20:20:58.284263333Z caller=logging.go:66 traceID=7963f60e2a1a2619 msg="GET /ready (200) 38.025µs"
level=debug ts=2020-09-03T20:21:08.284177299Z caller=logging.go:66 traceID=574cf0c624415978 msg="GET /ready (200) 43.811µs"
level=debug ts=2020-09-03T20:21:08.669663882Z caller=ruler.go:316 msg="rule group not owned, address does not match" owner_addr=10.85.60.132:9095 addr=10.85.107.91:9095
level=debug ts=2020-09-03T20:21:08.669698516Z caller=ruler.go:316 msg="rule group not owned, address does not match" owner_addr=10.85.60.132:9095 addr=10.85.107.91:9095
level=error ts=2020-09-03T20:21:08.669724514Z caller=manager.go:126 msg="unable to map rule files" user=..2020_09_03_20_20_02.632031398 err="mkdir /etc/cortex/ruler/rules: read-only file system"
jpdstan commented 4 years ago

This should be safe to close -- we fixed the error above by removing the -ruler.rule-path=/etc/cortex/ruler/rules argument.

jpdstan commented 4 years ago

For posterity, the unable to start blocks storage queryable subservices: not healthy error turned out to be a permissions one for us, in which we didn't provide the correct IAM role with proper s3 read/write permissions. Though, this kind of begs the question of why we need s3 access in the first place if we're using local as storage.

pstibrany commented 4 years ago

For posterity, the unable to start blocks storage queryable subservices: not healthy error turned out to be a permissions one for us, in which we didn't provide the correct IAM role with proper s3 read/write permissions.

Thanks for update!

Though, this kind of begs the question of why we need s3 access in the first place if we're using local as storage.

“blocks storage queryable” is a component responsible for querying data from long-term storage. Ruler uses it (because Ruler executes queries locally, not via querier), and this component needs to be able to read blocks from long-term storage.