cherti / mailexporter

Export Prometheus-style metrics about mail server functionality
https://prometheus.io
GNU General Public License v3.0
45 stars 9 forks source link

mail_deliver_success not changing to 0 on failure #34

Closed ScrumpyJack closed 2 years ago

ScrumpyJack commented 2 years ago

Here you can see mail_sent_fails_total goes up with each failure but mail_deliver_success stays at 1.

The problem was the mail server was down and connect to port 25 was being dropped (connection refused).

Is mail_deliver_success supposed to be 0 in this case?

Screenshot 2021-12-13 at 21 14 51 Screenshot 2021-12-13 at 21 15 09
cherti commented 2 years ago

Currently, this is the intended behavior, mail_deliver_ok will not be changed if sending fails, based on the reasoning that if the sending failed already, then delivery has arguably not been tested in the first place, so nothing can be said about that, so there is no reason to set a metric (implementation-wise, the 1 you see there is the 1 from 6:00 that hasn't been updated. last_mail_deliver_time should not have increased from 6:00 onwards either.

This is intended to distinguish between delivery errors and sending errors, because the SMTP-servers used for probing don't necessarily have to be the same as the server being monitored, and their downtime should not trigger delivery problem alerts if there aren't any. Arguably, depending on the setup, delivery problems are more urgent than sending problems, because delivery problems can cause the server to miss mails, whereas sending problems are typically, while annoying, obvious to the user. Hence, one might use mail_deliver_ok in a higher escalation stage than other metrics. To ensure this, sending failures do not tamper with mail_deliver_ok. This could, however, be made more clear in the Readme, so thanks for pointing this out!

To alert on something like this, you could build an alerting expression like now() - mail_last_deliver_time > 3600, which should trigger if no delivery probe has made it through in the last hour.