Open Abuelodelanada opened 7 months ago
Ideas: Relation manager charm libraries could have a healthcheck/update-status
endpoint that could be queried by consumer charms.
Other idea: collect-app-status - so that charm can collect statuses from itself and libraries it uses.
If someone forgets to deploy COS charms with traefik,
graph LR
subgraph lxd
ubuntu --- grafana-agent
end
subgraph microk8s
prometheus
loki
end
grafana-agent --- loki
grafana-agent --- prometheus
then the URLs sent over a CMR would be k8s cluster URLs, such as http://prom-0.prom-endpoints.pebnote.svc.cluster.local:9090/api/v1/write
, which won't be routeable.
Feb 27 21:54:44 juju-799803-2 grafana-agent.grafana-agent[8210]: ts=2024-02-27T21:54:44.643267725Z caller=dedupe.go:112 agent=prometheus instance=1bf1b94ab08a361769e96ef841afbe0e component=remote level=warn remote_name=1bf1b9-3b030a url=http://prom-0.prom-endpoints.pebnote.svc.cluster.local:9090/api/v1/write msg="Failed to send batch, retrying" err="Post \"http://prom-0.prom-endpoints.pebnote.svc.cluster.local:9090/api/v1/write\": dial tcp: lookup prom-0.prom-endpoints.pebnote.svc.cluster.local on 127.0.0.53:53: server misbehaving"
It could be handy if the charm blocks when the target URLs are not routeable. Some impl ideas:
Also, @Abuelodelanada suggested that there should be a way to alert the charm operator that grafana-agent is not sending telemetry other than charm blocked status. As we might not be able to reach Alertmanager, we might want to have another method of alerting - maybe cos-alerter could be used?
We also might think about self-monitoring here - does grafana-agent send its own metrics in any way to the COS stack at the moment? Seems like yes. If so, maybe we could use a missing metric or a metric that shows the status of data relay on a specific protocol to set up an alert.
cc @dstathis
From backlog refinement: suggestion is to first handle this in update_status, and then consider using a pebble notice if that adds advantages
Enhancement Proposal
If grafana-agent does not trust the CA that issues the certs for the external Prometheus and Loki, no error/warning is raised.
Let's say that we have the following deployment:
That it is related to a COS-Lite deployment with TLS enabled.
Grafana agent will generate its config file using https URLs provided by Prometheus and Loki through CMR:
The problem is that no metrics nor logs are sent to Prometheus and Loki and no error is raised/logged.