cherti / mailexporter

Export Prometheus-style metrics about mail server functionality
https://prometheus.io
GNU General Public License v3.0
45 stars 9 forks source link

mail_deliver_success not changing to 0 on failure #38

Closed Ekleog closed 7 months ago

Ekleog commented 8 months ago

This is a re-opening of #34, due to the suggested workaround not actually working. The suggested workaround was to use last_mail_deliver_time.

Unfortunately, that does not seem to actually work: if the sending fails after a reboot, mailexporter does not appear to re-create the time series (because it does not have a last mail deliver time yet).

In my case, a distribution upgrade (nixos 23.05 to 23.11) broke my sending server, and mailexporter was restarted alongside it with the whole reboot, leading to the failure not triggering an alert on prometheus.

Thanksfully I also run manual testing, so I caught it, but it'd have been nice if mailexporter could have caught it.

Do you think it'd make sense to add a mail_submission_success metric? I'd even argue that mail_deliver_success should be false if submission fails (because it was not actually delivered), and the onus of ignoring failed submissions should be on consumers who want to do so (by alerting on !mail_deliver_success && mail_submission_success), but I can live just as well with updating my alerts, it's just for new users that the current behavior is unexpected :)

Anyway, thank you for doing mailexporter! :)

cherti commented 7 months ago

a mail_submission_success (or rather mail_submission_failure) metric already exists as rate(mail_send_fails)[<timeframe>]. :)

mail_deliver_success stays 1 if a send fails because otherwise you would get alerts for your mail delivery if your external probe server has issues, although your delivery system works fine. This is why send_fails are tracked separately and mail_deliver_success explicitly doesn't change in case of send fails, because a failed send does not indicate a failed delivery.

Ekleog commented 7 months ago

Good point, I hadn't noticed the mail_send_fails metric! I'd still think that a mail_submission_success metric would make sense, because on my server it seems like if I set the timeframe too low prometheus always returns 0, and such a gauge would be easier to handle. Do you want to keep this issue open to track it, or do you think such a metric would not be useful enough?

That said, thank you for the information of the workaround!

I'm still sad that at least two of us fell into the mail_deliver_success trap until something other than mail-exporter let us know that mails were broken; but that's the way it is :sweat_smile:

Thank you for mail-exporter anyway! :)

cherti commented 7 months ago

you can just set <timeframe> sufficiently larger than the test interval, then it's always > 0 if there is a send fail somewhere. It's actually debatable if mail_deliver_success should've been a counter metric as well, given that this would be the most raw data and processing is done on the prometheus end, but that's not gonna change anymore at this point.

I'll close this given that the problem is resolved. :)