canonical / grafana-agent-k8s-operator

This charmed operator automates the operational procedures of running Grafana Agent, an open-soruce telemetry collector.
https://charmhub.io/grafana-agent-k8s
Apache License 2.0
8 stars 18 forks source link

Logs flooded with "dropped sample" errors #296

Closed slapcat closed 2 months ago

slapcat commented 4 months ago

Bug Description

System logs are flooded with the following error leading to high disk utilization:

caller=dedupe.go:112 agent=prometheus level=info url=http://10.242.34.71/cos-prometheus-0/api/v1/write msg="Dropped sample for series that was not explicitly dropped via relabelling"

There should be a way to trim these down or filter them.

To Reproduce

  1. Deploy grafana-agent charm.
  2. Relate to compatible machine charm.
  3. Watch /var/log/syslog for error.

Environment

Grafana-agent is related to machine charms in an openstack cloud.

Relevant log output

caller=dedupe.go:112 agent=prometheus level=info url=http://10.242.34.71/cos-prometheus-0/api/v1/write msg="Dropped sample for series that was not explicitly dropped via relabelling"

Additional context

No response

PietroPasotti commented 3 months ago

*edit find these logs in the grafana-agent logs

IbraAoad commented 3 months ago

Might be related https://github.com/prometheus/prometheus/issues/11589

ca-scribner commented 3 months ago

The linked issue form @IbraAoad suggests the problem is backpressure (good summary comment here). I tried to reproduce this but was unable to get "Sample dropped" errors.

First I tried just using a simple machine charm (ubuntu) but that didn't cause any issues for the grafana-agent machine charm.

Second I tried grafana-agent-k8s, using flog and avalanche to send large amounts of fake logs/metrics. I artificially created backpressure by continually killed prometheus and loki (while true; do kubectl delete pod prometheus-0; kubectl delete pod loki-0; sleep 1; done), but still couldn't reproduce the "Sample dropped" errors.

Do you have a minimal reproduction environment I could use?

Also, within your environment, it would be useful to have:

Which revision of the charm are you using? A related fix upstream was posted in agent 0.35.2, but that version is only in our agent charm 0.36.0 (commit), which I don't think either the machine or k8s versions of the charm use atm. Instead I see them using:

sed-i commented 3 months ago

If you could also include the output of the following at normal operation and at time of failure, that could be interesting.

juju ssh grafana-agent/0 curl -s localhost:12345/metrics | grep -E "wal|agent_inflight_requests|log_messages_total|failed" | grep -v "^# "

juju ssh gragna-agent/0 df -h
ca-scribner commented 3 months ago

Given that upstream thinks this is fixed in grafana-agent 0.36.0, probably we can fix this by bumping our grafana-agent. I don't know if there's a reason why we haven't done that in a while - I'll look into that.

Either way though, I'd really like to be able to reproduce these errors in some sort of test first so we can see bumping the version actually fixes things

ca-scribner commented 3 months ago

I might have been wrong about where this patch landed. the fix commit is tagged >=v0.36.0, but looks like it was also cherry-picked into v0.35.2 and shows up in the v0.35.2 changelog. So probably it is fixed in v0.35.2.

Either way, the next steps are:

  1. @slapcat: if you can, provide a good reproduction case
  2. observability: update grafana-agent to something more recent (~0.40.x)

I'm working on (2) now

ca-scribner commented 2 months ago

To address this, we've done the following:

Based on those fixes, we're closing this issue as we think it should be addressed. We were unsuccessful in actually reproducing the issue, however, so please reopen if you face this again!