QubitProducts / exporter_exporter

A reverse proxy designed for Prometheus exporters
Apache License 2.0
336 stars 55 forks source link

Race condition in error reporting? #31

Closed candlerb closed 4 years ago

candlerb commented 4 years ago

Seeing something strange, this is with the official exporter_exporter 0.3.0 binary. Sometimes error reports just show "context canceled" with no other message.

root@ldex-mon2:~# exporter_exporter -config.file=/etc/prometheus/exporter_exporter.yml -config.dirs=/etc/prometheus/exporter_exporter.d -web.tls.listen-address '127.0.0.1:9998' -web.tls.verify
FATA[0000] context canceled                              source="main.go:270"

But then a little later when I ran the exact same command:

root@ldex-mon2:~# exporter_exporter -config.file=/etc/prometheus/exporter_exporter.yml -config.dirs=/etc/prometheus/exporter_exporter.d -web.tls.listen-address '127.0.0.1:9998' -web.tls.verify
FATA[0000] Could not parse key/cert, open cert.pem: no such file or directory  source="main.go:234"

I'm not sure what's going on, but it seems like some sort of race condition.

It's a bit of a pain because I know something is going on here, but I can't see the error message:

root@ldex-mon2:~# exporter_exporter -config.file=/etc/prometheus/exporter_exporter.yml -config.dirs=/etc/prometheus/exporter_exporter.d -web.tls.listen-address ':9998' -web.tls.cert=/path/to/cert.pem -web.tls.key=/path/to/privkey.pem
FATA[0000] context canceled                              source="main.go:270"
candlerb commented 4 years ago

I found the problem.

When I run exporter_exporter, without specifying -web.listen-address, by default it tries to start a server on :9999. But I already had exporter_exporter running on this port, so the HTTP server fails to bind.

Hence the HTTP server immediately terminates, without printing an error but returning an err value. But when a member of an errgroup terminates with an error, all the other members of the group are cancelled. (Possibly the TLS goroutine was cancelled before it even started).

I tried using eg.Wait() but couldn't get it to work - it just hung. But converting the HTTP server to invoke Fatalf() sorts it:

root@ldex-mon2:~# ./exporter_exporter -config.file=/etc/prometheus/exporter_exporter.yml -config.dirs=/etc/prometheus/exporter_exporter.d -web.tls.listen-address ':9998' -web.tls.cert=/path/to/cert.pem -web.tls.key=/path/to/privkey.pem
FATA[0000] Failed starting HTTP server, listen tcp :9999: bind: address already in use  source="main.go:236"

I will submit a PR.