canonical / grafana-agent-operator

https://charmhub.io/grafana-agent
Apache License 2.0
4 stars 10 forks source link

No error is raised/logged if grafana agent cannot reach Prometheus/Loki #57

Open Abuelodelanada opened 7 months ago

Abuelodelanada commented 7 months ago

Enhancement Proposal

If grafana-agent does not trust the CA that issues the certs for the external Prometheus and Loki, no error/warning is raised.

Let's say that we have the following deployment:

image

That it is related to a COS-Lite deployment with TLS enabled.

Grafana agent will generate its config file using https URLs provided by Prometheus and Loki through CMR:

ubuntu@juju-41ca16-0:~$ cat /etc/grafana-agent.yaml  | grep https
    url: https://192.168.1.250/cos-prometheus-0/api/v1/write
      url: https://192.168.1.250/cos-loki-0/loki/api/v1/push
      url: https://192.168.1.250/cos-loki-0/loki/api/v1/push
      url: https://192.168.1.250/cos-prometheus-0/api/v1/write

The problem is that no metrics nor logs are sent to Prometheus and Loki and no error is raised/logged.

mmkay commented 5 months ago

Ideas: Relation manager charm libraries could have a healthcheck/update-status endpoint that could be queried by consumer charms.

Other idea: collect-app-status - so that charm can collect statuses from itself and libraries it uses.

mmkay commented 4 months ago

64 referred to a very similar problem:

Enhancement Proposal

If someone forgets to deploy COS charms with traefik,

graph LR

subgraph lxd
ubuntu --- grafana-agent
end

subgraph microk8s
prometheus
loki
end

grafana-agent --- loki
grafana-agent --- prometheus

then the URLs sent over a CMR would be k8s cluster URLs, such as http://prom-0.prom-endpoints.pebnote.svc.cluster.local:9090/api/v1/write, which won't be routeable.

Feb 27 21:54:44 juju-799803-2 grafana-agent.grafana-agent[8210]: ts=2024-02-27T21:54:44.643267725Z caller=dedupe.go:112 agent=prometheus instance=1bf1b94ab08a361769e96ef841afbe0e component=remote level=warn remote_name=1bf1b9-3b030a url=http://prom-0.prom-endpoints.pebnote.svc.cluster.local:9090/api/v1/write msg="Failed to send batch, retrying" err="Post \"http://prom-0.prom-endpoints.pebnote.svc.cluster.local:9090/api/v1/write\": dial tcp: lookup prom-0.prom-endpoints.pebnote.svc.cluster.local on 127.0.0.53:53: server misbehaving"

It could be handy if the charm blocks when the target URLs are not routeable. Some impl ideas:

mmkay commented 4 months ago

Also, @Abuelodelanada suggested that there should be a way to alert the charm operator that grafana-agent is not sending telemetry other than charm blocked status. As we might not be able to reach Alertmanager, we might want to have another method of alerting - maybe cos-alerter could be used?

We also might think about self-monitoring here - does grafana-agent send its own metrics in any way to the COS stack at the moment? Seems like yes. If so, maybe we could use a missing metric or a metric that shows the status of data relay on a specific protocol to set up an alert.

cc @dstathis

ca-scribner commented 3 months ago

From backlog refinement: suggestion is to first handle this in update_status, and then consider using a pebble notice if that adds advantages