SuperQ / smokeping_prober

Prometheus style smokeping
Apache License 2.0
554 stars 73 forks source link

New packet receive errors metric is spamming #114

Closed hhromic closed 1 year ago

hhromic commented 1 year ago

Hello, I just tried out the new v0.7.0 release of the smokeping prober and checked the new packet send/receive errors metrics.

The pinger prober seems to be working fine, but in our Linux machines the smokeping_receive_errors_total increases rapidly. When investigating this I enabled debug level logging, which revealed a flood of Error receiving packet errors wrapping i/o timeout errors (see a copy of the logs at the end of this issue).

These errors appear continously in a rapid stream, no matter what pinging destination I use.

I went to check the upstream pro-bing library, specifically the error handling code here: https://github.com/prometheus-community/pro-bing/blob/0999adf48b00a72b9ce152a562e2585ac986a6f6/ping.go#L652-L671

And it looks like timeout errors are normally expected and retried using backoff. Therefore I see two possibilities:

  1. The library consumer (e.g. smokeping exporter) is expected to take care of filtering out non-interesting errors.
  2. The library itself should not call OnRecvError() if the error is a controlled/expected error to be retried with backoff.

I'm not sure which one of these would be the correct thing to do because they both have potential use-cases. I do tend to think that most of the consumer applications don't want to manually filter out and thus benefit from (2).

For (2), the simple solution is to move calling OnRecvError() just before the return err call, so they both are consistent.

The same reasoning applies for the OnSendError() handler here: https://github.com/prometheus-community/pro-bing/blob/0999adf48b00a72b9ce152a562e2585ac986a6f6/ping.go#L809-L826

What do you think? If you believe (2) is the right thing to do, I can migrate this issue to the pro-bing repository and provide a fix.

example smokeping /metrics scraping for receive errors total ``` $ while curl -s localhost:9374/metrics | grep ^smokeping_receive_errors_total; do sleep 1; done smokeping_receive_errors_total 1269 smokeping_receive_errors_total 1286 smokeping_receive_errors_total 1303 smokeping_receive_errors_total 1318 smokeping_receive_errors_total 1342 smokeping_receive_errors_total 1362 ```
smokeping exporter logs ``` $ sudo ./smokeping_prober localhost --log.level=debug ts=2023-07-12T11:44:23.894Z caller=main.go:116 level=info msg="Starting smokeping_prober" version="(version=0.7.0, branch=HEAD, revision=04602a7bb7e864bc4458842ff6099815090f24d3)" ts=2023-07-12T11:44:23.895Z caller=main.go:117 level=info msg="Build context" build_context="(go=go1.20.5, platform=linux/amd64, user=root@1946cd9bbf6d, date=20230710-15:17:44, tags=netgo static_build)" ts=2023-07-12T11:44:23.895Z caller=main.go:126 level=info msg="ignoring missing config file" filename= ts=2023-07-12T11:44:23.895Z caller=main.go:153 level=info msg="Pinger resolved" host=localhost ip_addr=127.0.0.1 ts=2023-07-12T11:44:23.895Z caller=main.go:200 level=info msg="Waiting 1s between starting pingers" ts=2023-07-12T11:44:23.895Z caller=main.go:203 level=info msg="Starting prober" address=localhost interval=1s size_bytes=56 source= ts=2023-07-12T11:44:24.898Z caller=tls_config.go:274 level=info msg="Listening on" address=[::]:9374 ts=2023-07-12T11:44:24.898Z caller=tls_config.go:277 level=info msg="TLS is disabled." http2=false address=[::]:9374 ts=2023-07-12T11:44:24.993Z caller=collector.go:114 level=debug msg="Error receiving packet" error="read ip 0.0.0.0: raw-read ip4 0.0.0.0: i/o timeout" ts=2023-07-12T11:44:25.046Z caller=collector.go:114 level=debug msg="Error receiving packet" error="read ip 0.0.0.0: raw-read ip4 0.0.0.0: i/o timeout" ts=2023-07-12T11:44:25.145Z caller=collector.go:114 level=debug msg="Error receiving packet" error="read ip 0.0.0.0: raw-read ip4 0.0.0.0: i/o timeout" ts=2023-07-12T11:44:25.238Z caller=collector.go:114 level=debug msg="Error receiving packet" error="read ip 0.0.0.0: raw-read ip4 0.0.0.0: i/o timeout" ts=2023-07-12T11:44:25.291Z caller=collector.go:114 level=debug msg="Error receiving packet" error="read ip 0.0.0.0: raw-read ip4 0.0.0.0: i/o timeout" ts=2023-07-12T11:44:25.293Z caller=collector.go:114 level=debug msg="Error receiving packet" error="read ip 0.0.0.0: raw-read ip4 0.0.0.0: i/o timeout" ts=2023-07-12T11:44:25.376Z caller=collector.go:114 level=debug msg="Error receiving packet" error="read ip 0.0.0.0: raw-read ip4 0.0.0.0: i/o timeout" ts=2023-07-12T11:44:25.432Z caller=collector.go:114 level=debug msg="Error receiving packet" error="read ip 0.0.0.0: raw-read ip4 0.0.0.0: i/o timeout" ts=2023-07-12T11:44:25.521Z caller=collector.go:114 level=debug msg="Error receiving packet" error="read ip 0.0.0.0: raw-read ip4 0.0.0.0: i/o timeout" ts=2023-07-12T11:44:25.586Z caller=collector.go:114 level=debug msg="Error receiving packet" error="read ip 0.0.0.0: raw-read ip4 0.0.0.0: i/o timeout" ts=2023-07-12T11:44:25.661Z caller=collector.go:114 level=debug msg="Error receiving packet" error="read ip 0.0.0.0: raw-read ip4 0.0.0.0: i/o timeout" ts=2023-07-12T11:44:25.680Z caller=collector.go:114 level=debug msg="Error receiving packet" error="read ip 0.0.0.0: raw-read ip4 0.0.0.0: i/o timeout" ts=2023-07-12T11:44:25.773Z caller=collector.go:114 level=debug msg="Error receiving packet" error="read ip 0.0.0.0: raw-read ip4 0.0.0.0: i/o timeout" ts=2023-07-12T11:44:25.807Z caller=collector.go:114 level=debug msg="Error receiving packet" error="read ip 0.0.0.0: raw-read ip4 0.0.0.0: i/o timeout" ts=2023-07-12T11:44:25.837Z caller=collector.go:114 level=debug msg="Error receiving packet" error="read ip 0.0.0.0: raw-read ip4 0.0.0.0: i/o timeout" ts=2023-07-12T11:44:25.885Z caller=collector.go:114 level=debug msg="Error receiving packet" error="read ip 0.0.0.0: raw-read ip4 0.0.0.0: i/o timeout" ts=2023-07-12T11:44:25.896Z caller=collector.go:98 level=debug msg="Echo reply" ip_addr=127.0.0.1 bytes_received=64 icmp_seq=2 time=257.196µs ttl=64 ts=2023-07-12T11:44:25.919Z caller=collector.go:114 level=debug msg="Error receiving packet" error="read ip 0.0.0.0: raw-read ip4 0.0.0.0: i/o timeout" ts=2023-07-12T11:44:25.935Z caller=collector.go:114 level=debug msg="Error receiving packet" error="read ip 0.0.0.0: raw-read ip4 0.0.0.0: i/o timeout" ts=2023-07-12T11:44:25.956Z caller=collector.go:114 level=debug msg="Error receiving packet" error="read ip 0.0.0.0: raw-read ip4 0.0.0.0: i/o timeout" ts=2023-07-12T11:44:26.040Z caller=collector.go:114 level=debug msg="Error receiving packet" error="read ip 0.0.0.0: raw-read ip4 0.0.0.0: i/o timeout" ts=2023-07-12T11:44:26.124Z caller=collector.go:114 level=debug msg="Error receiving packet" error="read ip 0.0.0.0: raw-read ip4 0.0.0.0: i/o timeout" ts=2023-07-12T11:44:26.178Z caller=collector.go:114 level=debug msg="Error receiving packet" error="read ip 0.0.0.0: raw-read ip4 0.0.0.0: i/o timeout" ts=2023-07-12T11:44:26.242Z caller=collector.go:114 level=debug msg="Error receiving packet" error="read ip 0.0.0.0: raw-read ip4 0.0.0.0: i/o timeout" ts=2023-07-12T11:44:26.258Z caller=collector.go:114 level=debug msg="Error receiving packet" error="read ip 0.0.0.0: raw-read ip4 0.0.0.0: i/o timeout" ts=2023-07-12T11:44:26.331Z caller=collector.go:114 level=debug msg="Error receiving packet" error="read ip 0.0.0.0: raw-read ip4 0.0.0.0: i/o timeout" ts=2023-07-12T11:44:26.414Z caller=collector.go:114 level=debug msg="Error receiving packet" error="read ip 0.0.0.0: raw-read ip4 0.0.0.0: i/o timeout" ts=2023-07-12T11:44:26.434Z caller=collector.go:114 level=debug msg="Error receiving packet" error="read ip 0.0.0.0: raw-read ip4 0.0.0.0: i/o timeout" ts=2023-07-12T11:44:26.521Z caller=collector.go:114 level=debug msg="Error receiving packet" error="read ip 0.0.0.0: raw-read ip4 0.0.0.0: i/o timeout" ts=2023-07-12T11:44:26.522Z caller=collector.go:114 level=debug msg="Error receiving packet" error="read ip 0.0.0.0: raw-read ip4 0.0.0.0: i/o timeout" ts=2023-07-12T11:44:26.608Z caller=collector.go:114 level=debug msg="Error receiving packet" error="read ip 0.0.0.0: raw-read ip4 0.0.0.0: i/o timeout" ts=2023-07-12T11:44:26.633Z caller=collector.go:114 level=debug msg="Error receiving packet" error="read ip 0.0.0.0: raw-read ip4 0.0.0.0: i/o timeout" ts=2023-07-12T11:44:26.663Z caller=collector.go:114 level=debug msg="Error receiving packet" error="read ip 0.0.0.0: raw-read ip4 0.0.0.0: i/o timeout" ts=2023-07-12T11:44:26.756Z caller=collector.go:114 level=debug msg="Error receiving packet" error="read ip 0.0.0.0: raw-read ip4 0.0.0.0: i/o timeout" ts=2023-07-12T11:44:26.856Z caller=collector.go:114 level=debug msg="Error receiving packet" error="read ip 0.0.0.0: raw-read ip4 0.0.0.0: i/o timeout" ts=2023-07-12T11:44:26.887Z caller=collector.go:114 level=debug msg="Error receiving packet" error="read ip 0.0.0.0: raw-read ip4 0.0.0.0: i/o timeout" ts=2023-07-12T11:44:26.896Z caller=collector.go:98 level=debug msg="Echo reply" ip_addr=127.0.0.1 bytes_received=64 icmp_seq=3 time=484.181µs ttl=64 ts=2023-07-12T11:44:26.906Z caller=collector.go:114 level=debug msg="Error receiving packet" error="read ip 0.0.0.0: raw-read ip4 0.0.0.0: i/o timeout" ts=2023-07-12T11:44:26.931Z caller=collector.go:114 level=debug msg="Error receiving packet" error="read ip 0.0.0.0: raw-read ip4 0.0.0.0: i/o timeout" ts=2023-07-12T11:44:27.001Z caller=collector.go:114 level=debug msg="Error receiving packet" error="read ip 0.0.0.0: raw-read ip4 0.0.0.0: i/o timeout" ts=2023-07-12T11:44:27.063Z caller=collector.go:114 level=debug msg="Error receiving packet" error="read ip 0.0.0.0: raw-read ip4 0.0.0.0: i/o timeout" ts=2023-07-12T11:44:27.155Z caller=collector.go:114 level=debug msg="Error receiving packet" error="read ip 0.0.0.0: raw-read ip4 0.0.0.0: i/o timeout" ts=2023-07-12T11:44:27.197Z caller=collector.go:114 level=debug msg="Error receiving packet" error="read ip 0.0.0.0: raw-read ip4 0.0.0.0: i/o timeout" ts=2023-07-12T11:44:27.283Z caller=collector.go:114 level=debug msg="Error receiving packet" error="read ip 0.0.0.0: raw-read ip4 0.0.0.0: i/o timeout" ts=2023-07-12T11:44:27.380Z caller=collector.go:114 level=debug msg="Error receiving packet" error="read ip 0.0.0.0: raw-read ip4 0.0.0.0: i/o timeout" ts=2023-07-12T11:44:27.460Z caller=collector.go:114 level=debug msg="Error receiving packet" error="read ip 0.0.0.0: raw-read ip4 0.0.0.0: i/o timeout" ts=2023-07-12T11:44:27.490Z caller=collector.go:114 level=debug msg="Error receiving packet" error="read ip 0.0.0.0: raw-read ip4 0.0.0.0: i/o timeout" ts=2023-07-12T11:44:27.544Z caller=collector.go:114 level=debug msg="Error receiving packet" error="read ip 0.0.0.0: raw-read ip4 0.0.0.0: i/o timeout" ts=2023-07-12T11:44:27.576Z caller=collector.go:114 level=debug msg="Error receiving packet" error="read ip 0.0.0.0: raw-read ip4 0.0.0.0: i/o timeout" ts=2023-07-12T11:44:27.649Z caller=collector.go:114 level=debug msg="Error receiving packet" error="read ip 0.0.0.0: raw-read ip4 0.0.0.0: i/o timeout" ts=2023-07-12T11:44:27.692Z caller=collector.go:114 level=debug msg="Error receiving packet" error="read ip 0.0.0.0: raw-read ip4 0.0.0.0: i/o timeout" ts=2023-07-12T11:44:27.778Z caller=collector.go:114 level=debug msg="Error receiving packet" error="read ip 0.0.0.0: raw-read ip4 0.0.0.0: i/o timeout" ts=2023-07-12T11:44:27.845Z caller=collector.go:114 level=debug msg="Error receiving packet" error="read ip 0.0.0.0: raw-read ip4 0.0.0.0: i/o timeout" ts=2023-07-12T11:44:27.864Z caller=collector.go:114 level=debug msg="Error receiving packet" error="read ip 0.0.0.0: raw-read ip4 0.0.0.0: i/o timeout" ```
SuperQ commented 1 year ago

Hmm, good question. Maybe @floatingstatic could comment on this.

floatingstatic commented 1 year ago

Hi I’m not near a computer (on vacation) but reading the summary I think this is due to using a read deadline in pro-bing in a for loop. You can easily add a conditional to the on receive error handler to ignore the specific error type that is producing that spammy error. In fact I’m pretty sure there is an example of this just after the on receive error function handler call in pro-bing. Alternatively we could move the handler call after that but then we miss out on capturing a certain class of errors.Sent from my iPhoneOn Jul 12, 2023, at 9:25 AM, Ben Kochie @.***> wrote: Hmm, good question. Maybe @floatingstatic could comment on this.

—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: @.***>

hhromic commented 1 year ago

@floatingstatic thanks for replying despite being on vacation 🥇. What you describe is basically the same analysis I provided in the issue 👍.

The problem/question is, what would be the correct thing to do? (1) consumer handles it or (2) pro-bing handles it.

As you said, for (2) the error handler would not be capturing the timeout class of errors. Some consumers might or might not be interested on that. They might or might not want to do anything about timeout errors that pro-bing is retrying anyway.

So, (1) is more transparent for a library, but more annoying for the consumers (yes, I agree handling it at the consumer side is easy but still is boilerplate for these consumers that are not interested on those errors).

If (1) is decided upon, I think it should be documented in the library that is normal to get these errors due to the read deadline loop approach in the library.

EDIT: Ah, I just saw that @SuperQ decided on (1) for the smokeping exporter. 🚀

SuperQ commented 1 year ago

I'm also thinking about implementing (2) in the pro-bing library. But we can also handle it here for now.

hhromic commented 1 year ago

Tested the newly released v0.7.1 image, no more spamming :)

$ docker run --rm -it -p 9374:9374 quay.io/superq/smokeping-prober:v0.7.1 localhost --log.level=debug
ts=2023-07-12T16:21:30.684Z caller=main.go:116 level=info msg="Starting smokeping_prober" version="(version=0.7.1, branch=HEAD, revision=c0848dc8c58f2a59b75396b178c584096eb54b0c)"
ts=2023-07-12T16:21:30.684Z caller=main.go:117 level=info msg="Build context" build_context="(go=go1.20.5, platform=linux/amd64, user=root@984563b60de5, date=20230712-14:21:56, tags=netgo static_build)"
ts=2023-07-12T16:21:30.684Z caller=main.go:126 level=info msg="ignoring missing config file" filename=
ts=2023-07-12T16:21:30.685Z caller=main.go:153 level=info msg="Pinger resolved" host=localhost ip_addr=127.0.0.1
ts=2023-07-12T16:21:30.685Z caller=main.go:200 level=info msg="Waiting 1s between starting pingers"
ts=2023-07-12T16:21:30.685Z caller=main.go:203 level=info msg="Starting prober" address=localhost interval=1s size_bytes=56 source=
ts=2023-07-12T16:21:31.686Z caller=collector.go:100 level=debug msg="Echo reply" ip_addr=127.0.0.1 bytes_received=64 icmp_seq=1 time=290.4µs ttl=64
ts=2023-07-12T16:21:31.686Z caller=tls_config.go:274 level=info msg="Listening on" address=[::]:9374
ts=2023-07-12T16:21:31.686Z caller=tls_config.go:277 level=info msg="TLS is disabled." http2=false address=[::]:9374
ts=2023-07-12T16:21:32.686Z caller=collector.go:100 level=debug msg="Echo reply" ip_addr=127.0.0.1 bytes_received=64 icmp_seq=2 time=413.301µs ttl=64
ts=2023-07-12T16:21:33.686Z caller=collector.go:100 level=debug msg="Echo reply" ip_addr=127.0.0.1 bytes_received=64 icmp_seq=3 time=163.801µs ttl=64
ts=2023-07-12T16:21:34.687Z caller=collector.go:100 level=debug msg="Echo reply" ip_addr=127.0.0.1 bytes_received=64 icmp_seq=4 time=352.101µs ttl=64
ts=2023-07-12T16:21:35.686Z caller=collector.go:100 level=debug msg="Echo reply" ip_addr=127.0.0.1 bytes_received=64 icmp_seq=5 time=103.1µs ttl=64

And also errors metric not constantly increasing:

$ while curl -s localhost:9374/metrics | grep ^smokeping_receive_errors_total; do sleep 1; done
smokeping_receive_errors_total 0
smokeping_receive_errors_total 0
smokeping_receive_errors_total 0
smokeping_receive_errors_total 0
smokeping_receive_errors_total 0

Thanks for the quick fix and release!