Logs flooded with "dropped sample" errors

slapcat commented 4 months ago

Bug Description

System logs are flooded with the following error leading to high disk utilization:

caller=dedupe.go:112 agent=prometheus level=info url=http://10.242.34.71/cos-prometheus-0/api/v1/write msg="Dropped sample for series that was not explicitly dropped via relabelling"

There should be a way to trim these down or filter them.

To Reproduce

Deploy grafana-agent charm.
Relate to compatible machine charm.
Watch /var/log/syslog for error.

Environment

Grafana-agent is related to machine charms in an openstack cloud.

Relevant log output

caller=dedupe.go:112 agent=prometheus level=info url=http://10.242.34.71/cos-prometheus-0/api/v1/write msg="Dropped sample for series that was not explicitly dropped via relabelling"

Additional context

No response

PietroPasotti commented 3 months ago

*edit find these logs in the grafana-agent logs

IbraAoad commented 3 months ago

ca-scribner commented 3 months ago

The linked issue form @IbraAoad suggests the problem is backpressure (good summary comment here). I tried to reproduce this but was unable to get "Sample dropped" errors.

First I tried just using a simple machine charm (ubuntu) but that didn't cause any issues for the grafana-agent machine charm.

Second I tried grafana-agent-k8s, using flog and avalanche to send large amounts of fake logs/metrics. I artificially created backpressure by continually killed prometheus and loki (while true; do kubectl delete pod prometheus-0; kubectl delete pod loki-0; sleep 1; done), but still couldn't reproduce the "Sample dropped" errors.

Do you have a minimal reproduction environment I could use?

Also, within your environment, it would be useful to have:

the grafana-agent config file
these metrics when the problem is occurring:
- curl -s <grafana-agent-unit>:12345/metrics | grep "prometheus_remote_storage_enqueue_retries_total"
- curl -s <grafana-agent-unit>:12345/metrics | grep "prometheus_remote_storage_shards_max"

Which revision of the charm are you using? A related fix upstream was posted in agent ~~0.35.2, but that version is only in our agent charm~~ 0.36.0 (commit), which I don't think either the machine or k8s versions of the charm use atm. Instead I see them using:

grafana-agent installing the snap from latest, and snap info grafana-agent showing 0.35.4 as latest/stable
grafana-agent-k8s using the container ubuntu/grafana-agent:0.35.2-22.04_stable

sed-i commented 3 months ago

If you could also include the output of the following at normal operation and at time of failure, that could be interesting.

juju ssh grafana-agent/0 curl -s localhost:12345/metrics | grep -E "wal|agent_inflight_requests|log_messages_total|failed" | grep -v "^# "

juju ssh gragna-agent/0 df -h

ca-scribner commented 3 months ago

Given that upstream thinks this is fixed in grafana-agent 0.36.0, probably we can fix this by bumping our grafana-agent. I don't know if there's a reason why we haven't done that in a while - I'll look into that.

Either way though, I'd really like to be able to reproduce these errors in some sort of test first so we can see bumping the version actually fixes things

ca-scribner commented 3 months ago

I might have been wrong about where this patch landed. the fix commit is tagged >=v0.36.0, but looks like it was also cherry-picked into v0.35.2 and shows up in the v0.35.2 changelog. So probably it is fixed in v0.35.2.

Either way, the next steps are:

@slapcat: if you can, provide a good reproduction case
observability: update grafana-agent to something more recent (~0.40.x)

I'm working on (2) now

ca-scribner commented 2 months ago

To address this, we've done the following:

bump the grafana-agent workload to 0.40.4 for k8s and machine charms. Now all charms and architectures are all on the same 0.40.4 version
some other updates, like pinning the grafana-agent workload in machine charms so the grafana-agent workload version is predictable

Based on those fixes, we're closing this issue as we think it should be addressed. We were unsuccessful in actually reproducing the issue, however, so please reopen if you face this again!

canonical / grafana-agent-k8s-operator