cortexproject / cortex

A horizontally scalable, highly available, multi-tenant, long term Prometheus.
https://cortexmetrics.io/
Apache License 2.0
5.42k stars 790 forks source link

Ruler unable to list rules when s3 bucket uses percentage encoding #4722

Open PatrikMuniak opened 2 years ago

PatrikMuniak commented 2 years ago

Describe the bug We are trying to set up Cortex on premises and we are using a compatible s3 bucket called Hitachi Content Platform. Cortex Ruler failing to read rules on Hitachi Content Platform s3 compatible bucket. When Cortex tries to list the rulegroups it retrieves the bucket objects ( e.g. bG9raS1ub2Rlcy1ydWxlcw== on the bucket) with percent encoded characters %3D ( e.g. bG9raS1ub2Rlcy1ydWxlcw%3D%3D), this makes the decoding fail when listing rulegroups.

https://github.com/cortexproject/cortex/blob/347aacd2c836d5842db8ec972b40a26345b41d82/pkg/ruler/rulestore/bucketclient/bucket_client.go#L300

to reproduce the issue in the code I wrote this test.

package main

import (
    "encoding/base64"
    "fmt"
)

func main() {
    decodedNamespace, err := base64.URLEncoding.DecodeString("bG9raS1ub2Rlcy1ydWxlcw%3D%3D")//%3D%3D
    encoded := base64.URLEncoding.EncodeToString([]byte("loki-nodes-rules"))
    decoded, err2 := base64.URLEncoding.DecodeString(encoded)
    fmt.Println(string(decodedNamespace), err)
    fmt.Println(string(decoded), err2)
}

loki-nodes-rule illegal base64 data at input byte 22
loki-nodes-rules <nil>

To Reproduce Steps to reproduce the behavior:

  1. Start Cortex 1.11.1 with single-process-config-blocks.yaml
  2. set up a HCP bucket in the ruler
  3. upload a sample rule ./cortextool rules load ~/notes/paas/cortex-rules-alerts/ruler/loki-nodes-rules.yaml --address=http://<url>:9008 --id=nap-tom
  4. Check logs for errors coming from bucket_client.go ( check below fro the log I received)

config.yaml

# Configuration for running Cortex in single-process mode.
# This should not be used in production.  It is only for getting started
# and development.

# Disable the requirement that every request to Cortex has a
# X-Scope-OrgID header. `fake` will be substituted in instead.
auth_enabled: false

server:
  http_listen_port: 9008
  grpc_listen_port: 9099
  log_level: debug
  # Configure the server to allow messages up to 100MB.
  grpc_server_max_recv_msg_size: 104857600
  grpc_server_max_send_msg_size: 104857600
  grpc_server_max_concurrent_streams: 1000

distributor:
  shard_by_all_labels: true
  pool:
    health_check_ingesters: true

ingester_client:
  grpc_client_config:
    # Configure the client to allow messages up to 100MB.
    max_recv_msg_size: 104857600
    max_send_msg_size: 104857600
    grpc_compression: gzip

ingester:
  lifecycler:
    # The address to advertise for this ingester.  Will be autodiscovered by
    # looking up address on eth0 or en0; can be specified if this fails.
    # address: 127.0.0.1
    interface_names: [ens160] 
    # We want to start immediately and flush on shutdown.
    join_after: 0
    min_ready_duration: 0s
    final_sleep: 0s
    num_tokens: 512

    # Use an in memory ring store, so we don't need to launch a Consul.
    ring:
      kvstore:
        store: inmemory
      replication_factor: 1

storage:
  engine: blocks

blocks_storage:
  tsdb:
    dir: /tmp/cortex/tsdb

  bucket_store:
    sync_dir: /tmp/cortex/tsdb-sync

  # You can choose between local storage and Amazon S3, Google GCS and Azure storage. Each option requires additional configuration
  # as shown below. All options can be configured via flags as well which might be handy for secret inputs.
  backend: s3 # s3, gcs, azure or filesystem are valid options
  s3:
    bucket_name: eu-cortex-metrics
    endpoint: url
    access_key_id: "user"
    secret_access_key: "password"
    #insecure: true
    #signature_version: "v2"
    http:
      insecure_skip_verify: true

compactor:
  data_dir: /tmp/cortex/compactor
  sharding_ring:
    kvstore:
      store: inmemory

frontend_worker:
  match_max_concurrent: true

ruler:
  enable_api: true
  enable_sharding: false
  rule_path: /tmp/cortex/tmp-rules

ruler_storage:
  backend: s3
  local:
    directory: /tmp/cortex/rules
  s3:
    bucket_name: eu-cortex-ruler
    endpoint: url
    access_key_id: "user"
    secret_access_key: "password"
    #insecure: true
    #signature_version: "v2"
    http:
      insecure_skip_verify: true
EOF

loki-nodes-rules.yaml

groups:
  - name: loki-nodes
    rules:
    - alert: loki-up
      expr: up{application="loki"} == 1
      labels:
            severity: MAJOR
      annotations:
            description: "Loki is not running on {{ $labels.hostname }}"

Those are the logs that I was receiving:

level=warn ts=2022-04-15T16:21:04.619726789Z caller=bucket_client.go:147 msg="invalid rule group object key found while listing rule groups" user=fake key=bG9raS1ub2Rlcy1ydWxlcw%3D%3D/ err="illegal base64 data at input byte 22"

Expected behavior Not encounter any error and have the ruler be able to list the rules

Environment:

Storage Engine

Additional Context

alanprot commented 2 years ago

What is this bG9raS1ub2Rlcy1ydWxlcw== object?

PatrikMuniak commented 2 years ago

What is this bG9raS1ub2Rlcy1ydWxlcw== object?

@alanprot That is the namespace encoded in base64, it corresponds to the filename of the rule file I was trying to upload to cortex. In the bucket that's a folder that contains the rulegroup

alanprot commented 2 years ago

Oh Ok..

So basically for some reason the "Hitachi Content Platform" is encoding the response?

bG9raS1ub2Rlcy1ydWxlcw== to bG9raS1ub2Rlcy1ydWxlcw%3D%3D

So i guess the question is.. why this hitachi is encoding the response?

PatrikMuniak commented 2 years ago

@alanprot I checked to see if the issue would persist when when defining the s3 config inside the ruler: block and here seems to be working. example:

auth_enabled: true

server:
  http_listen_port: 9008
  grpc_listen_port: 9099
  log_level: debug

  grpc_server_max_recv_msg_size: 104857600
  grpc_server_max_send_msg_size: 104857600
  grpc_server_max_concurrent_streams: 1000

distributor:
  shard_by_all_labels: true
  pool:
    health_check_ingesters: true

ingester_client:
  grpc_client_config:
    max_recv_msg_size: 104857600
    max_send_msg_size: 104857600
    grpc_compression: gzip

ingester:
  lifecycler:
    interface_names: [ens160] 
    join_after: 0
    min_ready_duration: 0s
    final_sleep: 0s
    num_tokens: 512
    ring:
      kvstore:
        store: inmemory
      replication_factor: 1

storage:
  engine: blocks

blocks_storage:
  tsdb:
    dir: /tmp/cortex/tsdb

  bucket_store:
    sync_dir: /tmp/cortex/tsdb-sync

  backend: s3
  s3:
    bucket_name: eu-cortex-metrics
    endpoint: <endpoint>
    access_key_id: "<id>"
    secret_access_key: "<secret>"

    http:
      insecure_skip_verify: true

compactor:
  data_dir: /tmp/cortex/compactor
  sharding_ring:
    kvstore:
      store: inmemory

frontend_worker:
  match_max_concurrent: true

ruler:
  enable_api: true
  enable_sharding: false
  rule_path: /tmp/cortex/tmp-rules

  storage:
    type: s3
    s3:

      bucketnames: eu-cortex-ruler
      endpoint: <endpoint>
      access_key_id: "<id>"
      secret_access_key: "<secret>"
      http_config:
        insecure_skip_verify: true

I upload the rule with the same cortextool command and it doesn't give me errors

level=debug ts=2022-04-27T09:16:00.690149205Z caller=rule_store.go:147 msg="loading rule group" key="rules/nap-tom/bG9raS1ub2Rlcy1ydWxlcw==/bG9raS1ub2Rlcw==" user=nap-tom

If I switch to configuring the s3 bucket in the ruler_storage: block example:

auth_enabled: true

server:
  http_listen_port: 9008
  grpc_listen_port: 9099
  log_level: debug
  grpc_server_max_recv_msg_size: 104857600
  grpc_server_max_send_msg_size: 104857600
  grpc_server_max_concurrent_streams: 1000

distributor:
  shard_by_all_labels: true
  pool:
    health_check_ingesters: true

ingester_client:
  grpc_client_config:
    max_recv_msg_size: 104857600
    max_send_msg_size: 104857600
    grpc_compression: gzip

ingester:
  lifecycler:
    interface_names: [ens160] 
    join_after: 0
    min_ready_duration: 0s
    final_sleep: 0s
    num_tokens: 512

    ring:
      kvstore:
        store: inmemory
      replication_factor: 1

storage:
  engine: blocks

blocks_storage:
  tsdb:
    dir: /tmp/cortex/tsdb

  bucket_store:
    sync_dir: /tmp/cortex/tsdb-sync

  backend: s3
  s3:
    bucket_name: eu-cortex-metrics
    endpoint: <endpoint>
    access_key_id: "<id>"
    secret_access_key: "<secret>"
    http:
      insecure_skip_verify: true

compactor:
  data_dir: /tmp/cortex/compactor
  sharding_ring:
    kvstore:
      store: inmemory

frontend_worker:
  match_max_concurrent: true

ruler:
  enable_api: true
  enable_sharding: false
  rule_path: /tmp/cortex/tmp-rules

ruler_storage:
  backend: s3
  local:
    directory: /tmp/cortex/rules
  s3:
    bucket_name: eu-cortex-ruler
    endpoint: <endpoint>
    access_key_id: "<id>"
    secret_access_key: "<secret>"
    http:
      insecure_skip_verify: true

Those are the logs I see:

level=warn ts=2022-04-27T09:33:00.421710256Z caller=bucket_client.go:110 msg="invalid rule group object key found while listing rule groups" key=nap-tom/ err="invalid rule group object key"
level=warn ts=2022-04-27T09:33:00.421725842Z caller=bucket_client.go:110 msg="invalid rule group object key found while listing rule groups" key=nap-tom/bG9raS1ub2Rlcy1ydWxlcw%3D%3D/ err="illegal base64 data at input byte 22"
level=warn ts=2022-04-27T09:33:00.421735648Z caller=bucket_client.go:110 msg="invalid rule group object key found while listing rule groups" key=nap-tom/bG9raS1ub2Rlcy1ydWxlcw%3D%3D/bG9raS1ub2Rlcw%3D%3D err="illegal base64 data at input byte 22"

That looks like a Cortex issue

alanprot commented 2 years ago

Hum.. Interesting..

On the first case cortex uses the AWS SDK to call S3:

https://github.com/cortexproject/cortex/blob/2177ec0c9eb6b1ceb7d8808d97945e6557055bb8/pkg/ruler/storage.go#L102 https://github.com/cortexproject/cortex/blob/2177ec0c9eb6b1ceb7d8808d97945e6557055bb8/pkg/chunk/aws/s3_storage_client.go#L382

And on the second case we are using minio:

https://github.com/cortexproject/cortex/blob/2177ec0c9eb6b1ceb7d8808d97945e6557055bb8/pkg/ruler/storage.go#L119 https://github.com/cortexproject/cortex/blob/2177ec0c9eb6b1ceb7d8808d97945e6557055bb8/vendor/github.com/thanos-io/thanos/pkg/objstore/s3/s3.go#L247

I wonder if this explains the difference in behaviour here.

alvaropalmeirao commented 10 months ago

Hi all,

I'm struggling with the upload of the YAML file to s3. What is the command that you use to upload the rules to s3? Thanks

alvaropalmeirao commented 10 months ago

I found the way to do it: cortextool rules sync --backend=loki --id=fake --rule-files=test1.yml --address=https://<LOKI_ADDRESS>