I ran into an issue where grafana-agent was unable to reach Prometheus and I didn't get any clear feedback on the issue, thegrafana-agent charm was in active state.
I was integrating the observability stack to a machine charm using grafana-agent, here is my setup:
Add the integration in charm code
Deployed the charm on a machine juju model
Deployed grafana-agent on the same model
Deployed cos-lite on a k8s model
Integrated the applications as required
However the metrics of my charm were not being visible in prometheus. Upon some debugging I noticed the following message in the logs of grafana-agent:
ts=2024-04-23T08:09:36.894207378Z caller=dedupe.go:112 agent=prometheus instance=772477519ad7c8b1cfe32e99d44a4389 component=remote level=warn remote_name=772477-1d1fdd url=http://prometheus-0.prometheus-endpoints.cos.svc.cluster.local:9090/api/v1/write msg="Failed to send batch, retrying" err="Post \"http://prometheus-0.prometheus-endpoints.cos.svc.cluster.local:9090/api/v1/write\": dial tcp: lookup prometheus-0.prometheus-endpoints.cos.svc.cluster.local on 127.0.0.53:53: server misbehaving"
And of course juju show-unit grafana-agent/0 agreed with the value of the url:
It is failing because it is trying to reach Prometheus from outside of the k8s cluster using its fqdn.
Later I realised that Traefik was misconfigured during the deployment of cos-lite, once I fixed that everything seemed to be working fine.
My enhancement Proposal:
Is there any chance grafana-agent could report to the user that is unable to reach prometheus?
Enhancement Proposal
Hi Team,
I ran into an issue where
grafana-agent
was unable to reachPrometheus
and I didn't get any clear feedback on the issue, thegrafana-agent
charm was in active state.I was integrating the observability stack to a machine charm using
grafana-agent
, here is my setup:However the metrics of my charm were not being visible in
prometheus
. Upon some debugging I noticed the following message in the logs ofgrafana-agent
:ts=2024-04-23T08:09:36.894207378Z caller=dedupe.go:112 agent=prometheus instance=772477519ad7c8b1cfe32e99d44a4389 component=remote level=warn remote_name=772477-1d1fdd url=http://prometheus-0.prometheus-endpoints.cos.svc.cluster.local:9090/api/v1/write msg="Failed to send batch, retrying" err="Post \"http://prometheus-0.prometheus-endpoints.cos.svc.cluster.local:9090/api/v1/write\": dial tcp: lookup prometheus-0.prometheus-endpoints.cos.svc.cluster.local on 127.0.0.53:53: server misbehaving"
And of course
juju show-unit grafana-agent/0
agreed with the value of the url:It is failing because it is trying to reach
Prometheus
from outside of the k8s cluster using its fqdn. Later I realised thatTraefik
was misconfigured during the deployment ofcos-lite
, once I fixed that everything seemed to be working fine.My enhancement Proposal: Is there any chance
grafana-agent
could report to the user that is unable to reachprometheus
?