canonical / grafana-agent-operator

This charmed operator automates the operational procedures of running Grafana Agent, an open-soruce telemetry collector.
https://charmhub.io/grafana-agent
Apache License 2.0
4 stars 8 forks source link

Check if Grafana Agent can reach Prometheus #93

Open saltiyazan opened 2 months ago

saltiyazan commented 2 months ago

Enhancement Proposal

Hi Team,

I ran into an issue where grafana-agent was unable to reach Prometheus and I didn't get any clear feedback on the issue, thegrafana-agent charm was in active state.

I was integrating the observability stack to a machine charm using grafana-agent, here is my setup:

However the metrics of my charm were not being visible in prometheus. Upon some debugging I noticed the following message in the logs of grafana-agent: ts=2024-04-23T08:09:36.894207378Z caller=dedupe.go:112 agent=prometheus instance=772477519ad7c8b1cfe32e99d44a4389 component=remote level=warn remote_name=772477-1d1fdd url=http://prometheus-0.prometheus-endpoints.cos.svc.cluster.local:9090/api/v1/write msg="Failed to send batch, retrying" err="Post \"http://prometheus-0.prometheus-endpoints.cos.svc.cluster.local:9090/api/v1/write\": dial tcp: lookup prometheus-0.prometheus-endpoints.cos.svc.cluster.local on 127.0.0.53:53: server misbehaving"

And of course juju show-unit grafana-agent/0 agreed with the value of the url:

- relation-id: 4
    endpoint: send-remote-write
    cross-model: true
    related-endpoint: receive-remote-write
    application-data: {}
    related-units:
      prometheus/0:
        in-scope: true
        data:
          egress-subnets: 10.152.183.250/32
          ingress-address: 10.152.183.250
          private-address: 10.152.183.250
          remote_write: '{"url": "http://prometheus-0.prometheus-endpoints.cos.svc.cluster.local:9090/api/v1/write"}'

It is failing because it is trying to reach Prometheus from outside of the k8s cluster using its fqdn. Later I realised that Traefik was misconfigured during the deployment of cos-lite, once I fixed that everything seemed to be working fine.

My enhancement Proposal: Is there any chance grafana-agent could report to the user that is unable to reach prometheus?