frebib / nzbget-exporter

NZBGet Prometheus metrics exporter
20 stars 5 forks source link

i/o timeout casues go panic and exporter dies #5

Closed SpangleLabs closed 1 year ago

SpangleLabs commented 1 year ago

Hey there,

This is a really great exporter, thank you!

But occasionally lately, when my server is particularly overloaded and slow, the exporter gets an i/o timeout when trying to contact the nzbget API, and that kills the exporter.

Some logs here, via docker:

nzbget_exporter_1  | time='2023-06-15T06:38:00Z' level=info msg='GET /metrics' remote='192.168.1.117:53202'
nzbget_exporter_1  | time='2023-06-15T06:39:00Z' level=info msg='GET /metrics' remote='192.168.1.117:53202'
nzbget_exporter_1  | time='2023-06-15T06:40:08Z' level=info msg='GET /metrics' remote='192.168.1.117:51742'
nzbget_exporter_1  | time='2023-06-15T06:41:00Z' level=info msg='GET /metrics' remote='192.168.1.117:51742'
nzbget_exporter_1  | time='2023-06-15T06:42:00Z' level=info msg='GET /metrics' remote='192.168.1.117:59704'
nzbget_exporter_1  | time='2023-06-15T06:42:20Z' level=error msg='api get history' error='Get "http://nzbget:6789/jsonrpc/history": dial tcp: lookup nzbget on 127.0.0.11:53
: read udp 127.0.0.1:50044->127.0.0.11:53: i/o timeout'
nzbget_exporter_1  | time='2023-06-15T06:42:20Z' level=error msg='api get version' error='Get "http://nzbget:6789/jsonrpc/version": dial tcp: lookup nzbget on 127.0.0.11:53
: read udp 127.0.0.1:50044->127.0.0.11:53: i/o timeout'
nzbget_exporter_1  | time='2023-06-15T06:42:20Z' level=error msg='api get status' error='Get "http://nzbget:6789/jsonrpc/status": dial tcp: lookup nzbget on 127.0.0.11:53: 
read udp 127.0.0.1:50044->127.0.0.11:53: i/o timeout'
nzbget_exporter_1  | time='2023-06-15T06:42:20Z' level=error msg='api get servervolumes' error='Get "http://nzbget:6789/jsonrpc/servervolumes": dial tcp: lookup nzbget on 1
27.0.0.11:53: read udp 127.0.0.1:50044->127.0.0.11:53: i/o timeout'
nzbget_exporter_1  | time='2023-06-15T06:42:20Z' level=error msg='api get config' error='Get "http://nzbget:6789/jsonrpc/config": dial tcp: lookup nzbget on 127.0.0.11:53: read udp 127.0.0.1:50044->127.0.0.11:53: i/o timeout'
nzbget_exporter_1  | panic: send on closed channel
nzbget_exporter_1  | 
nzbget_exporter_1  | goroutine 15365 [running]:
nzbget_exporter_1  | main.(*NZBGetCollector).Collect.func1()
nzbget_exporter_1  |    /build/collect.go:285 +0x2c6
nzbget_exporter_1  | created by main.(*NZBGetCollector).Collect
nzbget_exporter_1  |    /build/collect.go:280 +0x178

It's easy enough to restart the pod when this happens, but it would be nicer if the exporter handled it gracefully

Similarly, would be nice if it exported a metric of whether nzbget is up or not

frebib commented 1 year ago

Nice find on this bug! That was a tricky case I never considered when writing this initially I guess. What is happening here is that all of the goroutines that wait for various nzbget responses are returning (like here https://github.com/frebib/nzbget-exporter/blob/d2d0b05c2604b35ff6f4efbda1d92d52701588ea/collect.go#L311), causing the Prometheus client library to close the channel before the config call is able to write it's error (here https://github.com/frebib/nzbget-exporter/blob/d2d0b05c2604b35ff6f4efbda1d92d52701588ea/collect.go#L285), causing the send on closed channel error. Normally we'd wait for the config call to return (https://github.com/frebib/nzbget-exporter/blob/d2d0b05c2604b35ff6f4efbda1d92d52701588ea/collect.go#L333), but if we error out that never gets hit.

It looks like you should consider switching DNS resolvers as yours seems to be timing out :/

frebib commented 1 year ago

In the case of checking to see if nzbget is up, the exporter will always return a 500 to Prometheus if any of the scrapes fail, so the up{} metric will have value 0 if that is the case- you should be able to use that as a signal to see if NZBGet is working or not