canonical / mimir-worker-k8s-operator

This charmed operator is part of automating the operational procedures of running Grafana Mimir, an open-source metrics backend, in microservices mode.
https://charmhub.io/mimir-worker-k8s
Apache License 2.0
0 stars 0 forks source link

Workers not working if S3 configuration is invalid but in `active` status #34

Open Abuelodelanada opened 3 months ago

Abuelodelanada commented 3 months ago

Bug Description

A misconfiguration in S3_integrator causes mimir worker not to work, but it is still in the active status.

Mimir worker should be in blocked status with a meaningful message.

To Reproduce

  1. Deploy coordinator: juju deploy ./*.charm coord --resource nginx-image=ubuntu/nginx:1.18-22.04_beta --resource nginx-prometheus-exporter-image=nginx/nginx-prometheus-exporter:1.1.0 --trust

  2. Deploy worker: juju deploy ./*.charm mimir --resource mimir-image=ubuntu/mimir:2.10.0-22.04 --trust --config all=True

  3. Deploy s3_integrator: juju deploy s3-integrator --channel edge --trust

  4. Deploy prometheus: juju deploy prometheus-k8s prom --channel edge --trust

  5. Config s3_integrator: juju run s3-integrator/leader sync-s3-credentials access-key=AccessKey secret-key=SecretKey bucket="mimir" endpoint="endpoint"

  6. Relate coord to mimir: juju relate coord mimir

  7. Relate coord to prometheus: juju relate prom:metrics-endpoint coord:self-metrics-endpoint

  8. Check in prometheus that the scrape job is scrapeable:

    image

  9. Relate coord to s3_integrator: juju relate coord s3-integrator

  10. Check in prometheus that the scrape job is NOT scrapeable:

    image

Environment

Model  Controller  Cloud/Region        Version  SLA          Timestamp
mimir  microk8s    microk8s/localhost  3.4.0    unsupported  08:52:20-03:00

App            Version  Status  Scale  Charm                  Channel  Rev  Address         Exposed  Message
coord                   active      1  mimir-coordinator-k8s            37  10.152.183.60   no       
mimir          2.10.0   active      1  mimir-worker-k8s                  4  10.152.183.133  no       
prom           2.50.1   active      1  prometheus-k8s         edge     173  10.152.183.101  no       
s3-integrator           active      1  s3-integrator          edge      17  10.152.183.171  no       

Unit              Workload  Agent  Address       Ports  Message
coord/0*          active    idle   10.1.200.104         
mimir/0*          active    idle   10.1.200.96          
prom/0*           active    idle   10.1.200.91          
s3-integrator/0*  active    idle   10.1.200.108         

Integration provider               Requirer                           Interface            Type     Message
coord:mimir-cluster                mimir:mimir-cluster                mimir_cluster        regular  
coord:self-metrics-endpoint        prom:metrics-endpoint              prometheus_scrape    regular  
prom:prometheus-peers              prom:prometheus-peers              prometheus_peers     peer     
s3-integrator:s3-credentials       coord:s3                           s3                   regular  
s3-integrator:s3-integrator-peers  s3-integrator:s3-integrator-peers  s3-integrator-peers  peer  

Relevant log output

ts=2024-04-11T11:49:53.788253976Z caller=sanity_check.go:39 level=info msg="Checking object storage config"
ts=2024-04-11T11:49:55.813866533Z caller=seed.go:127 level=warn msg="failed to read cluster seed file from object storage" err="Get \"https://endpoint/mimir/?location=\": dial tcp: lookup endpoint on 10.152.183.10:53: no such host"
ts=2024-04-11T11:49:55.814834682Z caller=sanity_check.go:115 level=warn msg="Unable to successfully connect to configured object storage (will retry)" err="2 errors: blocks storage: unable to successfully send a request to object storage: Get \"https://endpoint/mimir/?location=\": dial tcp: lookup endpoint on 10.152.183.10:53: no such host; ruler storage: unable to successfully send a request to object storage: Get \"https://endpoint/mimir/?location=\": dial tcp: lookup endpoint on 10.152.183.10:53: no such host"
ts=2024-04-11T11:49:57.161752961Z caller=sanity_check.go:115 level=warn msg="Unable to successfully connect to configured object storage (will retry)" err="2 errors: blocks storage: unable to successfully send a request to object storage: Get \"https://endpoint/mimir/?location=\": dial tcp: lookup endpoint on 10.152.183.10:53: no such host; ruler storage: unable to successfully send a request to object storage: Get \"https://endpoint/mimir/?location=\": dial tcp: lookup endpoint on 10.152.183.10:53: no such host"
ts=2024-04-11T11:49:57.591882368Z caller=seed.go:127 level=warn msg="failed to read cluster seed file from object storage" err="Get \"https://endpoint/mimir/?location=\": dial tcp: lookup endpoint on 10.152.183.10:53: no such host"
ts=2024-04-11T11:49:59.940176817Z caller=sanity_check.go:115 level=warn msg="Unable to successfully connect to configured object storage (will retry)" err="2 errors: blocks storage: unable to successfully send a request to object storage: Get \"https://endpoint/mimir/?location=\": dial tcp: lookup endpoint on 10.152.183.10:53: no such host; ruler storage: unable to successfully send a request to object storage: Get \"https://endpoint/mimir/?location=\": dial tcp: lookup endpoint on 10.152.183.10:53: no such host"
ts=2024-04-11T11:50:01.2347005Z caller=seed.go:127 level=warn msg="failed to read cluster seed file from object storage" err="Get \"https://endpoint/mimir/?location=\": dial tcp: lookup endpoint on 10.152.183.10:53: no such host"
ts=2024-04-11T11:50:04.403887627Z caller=sanity_check.go:115 level=warn msg="Unable to successfully connect to configured object storage (will retry)" err="2 errors: blocks storage: unable to successfully send a request to object storage: Get \"https://endpoint/mimir/?location=\": dial tcp: lookup endpoint on 10.152.183.10:53: no such host; ruler storage: unable to successfully send a request to object storage: Get \"https://endpoint/mimir/?location=\": dial tcp: lookup endpoint on 10.152.183.10:53: no such host"
ts=2024-04-11T11:50:07.125160902Z caller=seed.go:127 level=warn msg="failed to read cluster seed file from object storage" err="Get \"https://endpoint/mimir/?location=\": dial tcp: lookup endpoint on 10.152.183.10:53: no such host"
ts=2024-04-11T11:50:08.474983073Z caller=sanity_check.go:115 level=warn msg="Unable to successfully connect to configured object storage (will retry)" err="2 errors: blocks storage: unable to successfully send a request to object storage: Get \"https://endpoint/mimir/?location=\": dial tcp: lookup endpoint on 10.152.183.10:53: no such host; ruler storage: unable to successfully send a request to object storage: Get \"https://endpoint/mimir/?location=\": dial tcp: lookup endpoint on 10.152.183.10:53: no such host"
ts=2024-04-11T11:50:12.652371086Z caller=sanity_check.go:115 level=warn msg="Unable to successfully connect to configured object storage (will retry)" err="2 errors: blocks storage: unable to successfully send a request to object storage: Get \"https://endpoint/mimir/?location=\": dial tcp: lookup endpoint on 10.152.183.10:53: no such host; ruler storage: unable to successfully send a request to object storage: Get \"https://endpoint/mimir/?location=\": dial tcp: lookup endpoint on 10.152.183.10:53: no such host"
ts=2024-04-11T11:50:17.319101362Z caller=sanity_check.go:115 level=warn msg="Unable to successfully connect to configured object storage (will retry)" err="2 errors: blocks storage: unable to successfully send a request to object storage: Get \"https://endpoint/mimir/?location=\": dial tcp: lookup endpoint on 10.152.183.10:53: no such host; ruler storage: unable to successfully send a request to object storage: Get \"https://endpoint/mimir/?location=\": dial tcp: lookup endpoint on 10.152.183.10:53: no such host"
ts=2024-04-11T11:50:19.67827222Z caller=seed.go:127 level=warn msg="failed to read cluster seed file from object storage" err="Get \"https://endpoint/mimir/?location=\": dial tcp: lookup endpoint on 10.152.183.10:53: no such host"
ts=2024-04-11T11:50:21.970860115Z caller=sanity_check.go:115 level=warn msg="Unable to successfully connect to configured object storage (will retry)" err="2 errors: blocks storage: unable to successfully send a request to object storage: Get \"https://endpoint/mimir/?location=\": dial tcp: lookup endpoint on 10.152.183.10:53: no such host; ruler storage: unable to successfully send a request to object storage: Get \"https://endpoint/mimir/?location=\": dial tcp: lookup endpoint on 10.152.183.10:53: no such host"

Additional context

bundle: kubernetes
applications:
  coord:
    charm: local:mimir-coordinator-k8s-37
    scale: 1
    constraints: arch=amd64
    trust: true
  mimir:
    charm: local:mimir-worker-k8s-4
    scale: 1
    options:
      all: true
    constraints: arch=amd64
    storage:
      data: kubernetes,1,1024M
      recovery-data: kubernetes,1,1024M
    trust: true
  prom:
    charm: prometheus-k8s
    channel: edge
    revision: 173
    base: ubuntu@20.04/stable
    resources:
      prometheus-image: 141
    scale: 1
    constraints: arch=amd64
    storage:
      database: kubernetes,1,1024M
    trust: true
  s3-integrator:
    charm: s3-integrator
    channel: edge
    revision: 17
    scale: 1
    constraints: arch=amd64
    trust: true
relations:
- - prom:metrics-endpoint
  - coord:self-metrics-endpoint
- - coord:mimir-cluster
  - mimir:mimir-cluster
- - coord:s3
  - s3-integrator:s3-credentials
lucabello commented 3 months ago

We should probably solve this by adding a Loki alert rule to make sure S3 is working.

ca-scribner commented 3 weeks ago

Some notes from breakdown meeting: