cortexproject / cortex

A horizontally scalable, highly available, multi-tenant, long term Prometheus.
https://cortexmetrics.io/
Apache License 2.0
5.48k stars 802 forks source link

Ruler failing to poll for rule groups #3136

Closed jpdstan closed 4 years ago

jpdstan commented 4 years ago

On process startup, the ruler immediately fails to poll for rule groups. We get the following error:

level=info ts=2020-09-04T06:30:13.069468201Z caller=cortex.go:307 msg="Cortex started"
level=error ts=2020-09-04T06:30:13.069488601Z caller=ruler.go:438 msg="unable to poll for rules" err="failed to list rule groups for user fake: failed to list rule group for user fake and namespace ..data: error parsing /etc/cortex/ruler/fake/..data: /etc/cortex/ruler/fake/..data: read /etc/cortex/ruler/fake/..data: is a directory"

Looking at that directory on that container, I do see the rule groups that we added:

$ ls -al /etc/cortex/ruler/fake
total 12
drwxrwxrwx    3 root     root          4096 Sep  4 06:30 .
drwxr-xr-x    3 root     root          4096 Sep  4 06:30 ..
drwxr-xr-x    2 root     root          4096 Sep  4 06:30 ..2020_09_04_06_30_06.031658285
lrwxrwxrwx    1 root     root            31 Sep  4 06:30 ..data -> ..2020_09_04_06_30_06.031658285
lrwxrwxrwx    1 root     root            17 Sep  4 06:30 common.yml -> ..data/common.yml
lrwxrwxrwx    1 root     root            16 Sep  4 06:30 nginx.yml -> ..data/nginx.yml

It looks like it skips directories, but tries to follow symbolic links pointing to directories. It needs to be a bit smarter - It still needs to follow links to files, but not directories.

Our current configuration for reference:

Containers:
   ruler:
    Image:       cortexproject/cortex:v1.3.0
    Ports:       80/TCP, 9095/TCP
    Host Ports:  0/TCP, 0/TCP
    Args:
      -auth.enabled=false
      -consul.hostname=consul.cortex-tsdb.svc.cluster.local:8500
      -distributor.health-check-ingesters=true
      -distributor.replication-factor=3
      -distributor.shard-by-all-labels=true
      -dynamodb.api-limit=10
      -dynamodb.url=https://us-west-2
      -experimental.blocks-storage.bucket-store.ignore-deletion-marks-delay=1h
      -experimental.blocks-storage.bucket-store.metadata-cache.backend=memcached
      -experimental.blocks-storage.bucket-store.metadata-cache.memcached.addresses=dnssrvnoa+memcached-metadata.cortex-tsdb.svc.cluster.local:11211
      -experimental.blocks-storage.bucket-store.metadata-cache.memcached.max-async-buffer-size=25000
      -experimental.blocks-storage.bucket-store.metadata-cache.memcached.max-async-concurrency=50
      -experimental.blocks-storage.bucket-store.metadata-cache.memcached.max-get-multi-batch-size=100
      -experimental.blocks-storage.bucket-store.metadata-cache.memcached.max-item-size=1048576
      -experimental.blocks-storage.bucket-store.metadata-cache.memcached.timeout=200ms
      -experimental.blocks-storage.bucket-store.sync-dir=/data/tsdb
      -experimental.blocks-storage.s3.bucket-name=robinhood-cortex-dev-tsdb
      -experimental.blocks-storage.s3.endpoint=s3.us-west-2.amazonaws.com
      -experimental.blocks-storage.tsdb.block-ranges-period=2h
      -experimental.blocks-storage.tsdb.dir=/data/tsdb
      -experimental.blocks-storage.tsdb.retention-period=96h
      -experimental.blocks-storage.tsdb.ship-interval=1m
      -experimental.ruler.enable-api=true
      -experimental.store-gateway.replication-factor=3
      -experimental.store-gateway.sharding-enabled=true
      -experimental.store-gateway.sharding-ring.consul.hostname=consul.cortex-tsdb.svc.cluster.local:8500
      -experimental.store-gateway.sharding-ring.prefix=
      -experimental.store-gateway.sharding-ring.store=consul
      -limits.per-user-override-config=/etc/cortex/overrides.yaml
      -querier.query-ingesters-within=13h
      -querier.query-store-after=12h
      -ring.heartbeat-timeout=10m
      -ring.prefix=
      -ruler.alertmanager-url=http://alertmanager.cortex-tsdb.svc.cluster.local/alertmanager
      -ruler.enable-sharding=true
      -ruler.ring.consul.hostname=consul.cortex-tsdb.svc.cluster.local:8500
      -ruler.storage.local.directory=/etc/cortex/ruler
      -ruler.storage.type=local
      -s3.url=https://us-west-2/robinhood-cortex-dev-tsdb
      -schema-config-file=/etc/cortex/schema/config.yaml
      -store.cardinality-limit=1000000
      -store.engine=blocks
      -store.max-query-length=744h
      -target=ruler

cc @pstibrany

gotjosh commented 4 years ago

Thanks for reporting this! The ruler local store is very basic at the moment, it is used mostly for "quick" testing and is nowhere near production-ready.

That being said, I think this is worth fixing.

gotjosh commented 4 years ago

Looks like @pstibrany is already on it 👍

amckinley commented 4 years ago

and is nowhere near production-ready.

Do you actually think we're going to run into issues with this? Our plan was to migrate our current (extensive) Prometheus alerting and recording rules to a k8s ConfigMap mounted locally, so we could continue to keep everything in git. I can see how the ruler APIs are cool, but is there any tooling right now to do things like sync a directory of rule files to the API?

gotjosh commented 4 years ago

Our plan was to migrate our current (extensive) Prometheus alerting and recording rules to a k8s ConfigMap mounted locally, so we could continue to keep everything in git

Of the top of my head, I don't think you'll encounter any issues going down that route. It's mostly about the lack of feature parity with the config API and the fact there's really no benefit from using local store (e.g. as described in #3134 you'll still have to write the rules to disk when polling, in this case: twice)

but is there any tooling right now to do things like sync a directory of rule files to the API?

We have some tooling in place to do this in the form of https://github.com/grafana/cortex-tools/ and https://github.com/grafana/cortex-rules-action but it might be a bit too specific depending on the way you do authentication. That being said, it should be fairly trivial to make the authentication bits optional within the CLI should there be a need for it.