canonical / prometheus-k8s-operator

This charmed operator automates the operational procedures of running Prometheus, an open-source metrics backend.
https://charmhub.io/prometheus-k8s
Apache License 2.0
21 stars 34 forks source link

Non-functional deployment with latest/edge - /cos-prometheus-0: 404 Not Found #547

Closed nobuto-m closed 9 months ago

nobuto-m commented 10 months ago

Bug Description

When deploying prometheus-k8s with the edge channel, the endpoint returns 404 page not found. And grafana-agent will see the following error as a remote write failure.

Nov 07 08:37:12 witty-turtle grafana-agent.grafana-agent[71124]: 
ts=2023-11-07T08:37:12.712451803Z caller=dedupe.go:112 agent=prometheus 
instance=2d541e27df39956fc7e2cbd9f44fa4c3 component=remote level=error 
remote_name=2d541e-253980 
url=http://192.168.151.81:80/cos-prometheus-0/api/v1/write 
msg="non-recoverable error while sending metadata" count=446 
err="server returned HTTP status 404 Not Found: 404 page not found"

To Reproduce

  1. prepare microk8s for COS
  2. deploy cos-lite with prometheus-k8s edge
$ cat ./overlay-prometheus-edge.yaml
bundle: kubernetes
applications:
  prometheus:
    charm: prometheus-k8s
    channel: latest/edge
    scale: 1
    trust: true
$ juju deploy cos-lite --trust --overlay ./overlay-prometheus-edge.yaml
Located bundle "cos-lite" in charm-hub, revision 11
Located charm "alertmanager-k8s" in charm-hub, channel stable
Located charm "catalogue-k8s" in charm-hub, channel stable
Located charm "grafana-k8s" in charm-hub, channel stable
Located charm "loki-k8s" in charm-hub, channel stable
Located charm "prometheus-k8s" in charm-hub, channel latest/edge
Located charm "traefik-k8s" in charm-hub, channel stable

...
$ juju show-unit catalogue/0 --format json | jq -r '."catalogue/0"."relation-info"[]."application-data".url' 
null
http://192.168.151.81/cos-grafana
http://192.168.151.81:80/cos-prometheus-0
http://192.168.151.81:80/cos-alertmanager
$ curl -sv http://192.168.151.81:80/cos-prometheus-0
*   Trying 192.168.151.81:80...
* Connected to 192.168.151.81 (192.168.151.81) port 80 (#0)
> GET /cos-prometheus-0 HTTP/1.1
> Host: 192.168.151.81
> User-Agent: curl/7.81.0
> Accept: */*
> 
* Mark bundle as not supporting multiuse
< HTTP/1.1 404 Not Found
< Content-Type: text/plain; charset=utf-8
< X-Content-Type-Options: nosniff
< Date: Tue, 07 Nov 2023 09:02:47 GMT
< Content-Length: 19
< 
404 page not found
* Connection #0 to host 192.168.151.81 left intact

Environment

$ juju status
Model  Controller       Cloud/Region            Version  SLA          Timestamp
cos    maas-controller  cos-microk8s/localhost  3.1.6    unsupported  09:01:03Z

App           Version  Status  Scale  Charm             Channel      Rev  Address         Exposed  Message
alertmanager  0.25.0   active      1  alertmanager-k8s  stable        77  10.152.183.203  no       
catalogue              active      1  catalogue-k8s     stable        19  10.152.183.184  no       
grafana       9.2.1    active      1  grafana-k8s       stable        82  10.152.183.69   no       
loki          2.7.4    active      1  loki-k8s          stable        91  10.152.183.147  no       
prometheus    2.46.0   active      1  prometheus-k8s    latest/edge  154  10.152.183.49   no       
traefik       2.9.6    active      1  traefik-k8s       stable       129  192.168.151.81  no       

Unit             Workload  Agent  Address       Ports  Message
alertmanager/0*  active    idle   10.1.237.168         
catalogue/0*     active    idle   10.1.237.165         
grafana/0*       active    idle   10.1.237.171         
loki/0*          active    idle   10.1.237.169         
prometheus/0*    active    idle   10.1.237.174         
traefik/0*       active    idle   10.1.237.172         

Relevant log output

juju_debug-log_cos.log

Additional context

No response

nobuto-m commented 10 months ago

Fwiw, using latest/edge for both prometheus-k8s and traefik-k8s makes it work. But I haven't looked into details on why.

[overlay]

bundle: kubernetes
applications:
  prometheus:
    channel: latest/edge
  traefik:
    channel: latest/edge
lucabello commented 9 months ago

Closing it since it appears to be solved.

nobuto-m commented 9 months ago

This is not solved at all. I've seen an instance of it on a customer environment where they tried to follow the latest/stable track by refreshing the charms. As in the above example, a charm in stable and another charm in edge couldn't be mixed because of backward incompatibility in the traefix relation.

A hook execution error and /cos-prometheus-0: 404 Not Found occurred during the upgrade since at one point multiple "generations" of charms co-exist in the model. As we are taking the rolling release model in COS, there is no announcement about incoming backward incompatibilities or any upgrade notes, the model was stuck with the error.

In the end, multiple iterations of ignoring the error by running juju resolved --no-retry and bouncing back relations recovered the model state, but the UX in terms following the latest/stable track of COS wasn't great for the customer.