foxcpp / maddy

✉️ Composable all-in-one mail server.
https://maddy.email
GNU General Public License v3.0
4.81k stars 238 forks source link

queue: infinite retries after reducing max_tries and bogus retry interval #678

Open lukastribus opened 5 months ago

lukastribus commented 5 months ago

Describe the bug

Likely to due a == comparison as opposed to >= , when a email in queue has a retry count higher than max_retries, it will be retried infinitely as the condition never matches:

https://github.com/foxcpp/maddy/blob/cee577790b4c6c9533c46492282d520eef420f7e/internal/target/queue/queue.go#L405

A unrelated secondary issue is seen with the retry delay; which does not match the documented formula; in fact the first 4 retry delays are the same exact 14m59.99 (see logs below). I can file a separate issue if needed.

Steps to reproduce

This can happen in the following situation:

Log files

root@srv1:/var/lib/maddy/remote_queue# journalctl -xeu maddy.service | grep "will retry" | grep ef97411f
Feb 09 16:52:40 srv1 maddy[2732258]: queue: will retry        {"attempts_count":{"support@example.net":1},"msg_id":"ef97411f","next_try_delay":"14m59.999996196s","rcpts":["support@example.net"]}
Feb 09 17:07:40 srv1 maddy[2732258]: queue: will retry        {"attempts_count":{"support@example.net":2},"msg_id":"ef97411f","next_try_delay":"14m59.99999822s","rcpts":["support@example.net"]}
Feb 09 17:22:40 srv1 maddy[2732258]: queue: will retry        {"attempts_count":{"support@example.net":3},"msg_id":"ef97411f","next_try_delay":"14m59.999997532s","rcpts":["support@example.net"]}
Feb 09 17:37:41 srv1 maddy[2732258]: queue: will retry        {"attempts_count":{"support@example.net":4},"msg_id":"ef97411f","next_try_delay":"14m59.999996182s","rcpts":["support@example.net"]}
Feb 09 17:52:41 srv1 maddy[2732258]: queue: will retry        {"attempts_count":{"support@example.net":5},"msg_id":"ef97411f","next_try_delay":"29m59.999995718s","rcpts":["support@example.net"]}
Feb 09 18:22:41 srv1 maddy[2732258]: queue: will retry        {"attempts_count":{"support@example.net":6},"msg_id":"ef97411f","next_try_delay":"44m59.999996344s","rcpts":["support@example.net"]}
Feb 09 19:07:41 srv1 maddy[2732258]: queue: will retry        {"attempts_count":{"support@example.net":7},"msg_id":"ef97411f","next_try_delay":"44m59.999996027s","rcpts":["support@example.net"]}
Feb 09 19:52:41 srv1 maddy[2732258]: queue: will retry        {"attempts_count":{"support@example.net":8},"msg_id":"ef97411f","next_try_delay":"59m59.999996174s","rcpts":["support@example.net"]}
Feb 09 20:52:42 srv1 maddy[2732258]: queue: will retry        {"attempts_count":{"support@example.net":9},"msg_id":"ef97411f","next_try_delay":"1h14m59.999997755s","rcpts":["support@example.net"]}
Feb 09 22:07:42 srv1 maddy[923]: queue: will retry        {"attempts_count":{"support@example.net":10},"msg_id":"ef97411f","next_try_delay":"1h44m59.999995644s","rcpts":["support@example.net"]}
Feb 09 23:52:42 srv1 maddy[923]: queue: will retry        {"attempts_count":{"support@example.net":11},"msg_id":"ef97411f","next_try_delay":"2h14m59.99999673s","rcpts":["support@example.net"]}
Feb 10 02:07:42 srv1 maddy[923]: queue: will retry        {"attempts_count":{"support@example.net":12},"msg_id":"ef97411f","next_try_delay":"2h44m59.99999787s","rcpts":["support@example.net"]}
Feb 10 04:52:42 srv1 maddy[923]: queue: will retry        {"attempts_count":{"support@example.net":13},"msg_id":"ef97411f","next_try_delay":"3h29m59.999997578s","rcpts":["support@example.net"]}
Feb 10 08:22:43 srv1 maddy[923]: queue: will retry        {"attempts_count":{"support@example.net":14},"msg_id":"ef97411f","next_try_delay":"4h29m59.999997642s","rcpts":["support@example.net"]}
Feb 10 12:52:43 srv1 maddy[41407]: queue: will retry        {"attempts_count":{"support@example.net":15},"msg_id":"ef97411f","next_try_delay":"5h29m59.999996433s","rcpts":["support@example.net"]}
Feb 10 18:22:43 srv1 maddy[41407]: queue: will retry        {"attempts_count":{"support@example.net":16},"msg_id":"ef97411f","next_try_delay":"6h59m59.999997628s","rcpts":["support@example.net"]}
Feb 11 01:22:43 srv1 maddy[41407]: queue: will retry        {"attempts_count":{"support@example.net":17},"msg_id":"ef97411f","next_try_delay":"8h44m59.999996512s","rcpts":["support@example.net"]}
Feb 11 10:07:43 srv1 maddy[41407]: queue: will retry        {"attempts_count":{"support@example.net":18},"msg_id":"ef97411f","next_try_delay":"10h59m59.999996882s","rcpts":["support@example.net"]}
root@srv1:/var/lib/maddy/remote_queue#

When maddy was restarted (with different configuration) can be seen by the changing PID.

Configuration file

target.queue remote_queue {
    max_tries 8
    target &outbound_delivery

Environment information

foxcpp commented 5 months ago

Pushed 2da4ece that changes the condition to use >=. Will look into what is wrong with delay calculation...

lukastribus commented 4 months ago

For the record, here's an output without any restarts in between and with the default of 20 retries. The To address ignorame2@messagenet.it is real (and unreachable, so you can reproduce; there is a TLS issue which makes it retry) :

lukas@srv1:~$ sudo journalctl -xeu maddy.service | grep -E "retry.+c5eb84cd|permanent.+c5eb84cd"
Feb 17 23:54:35 srv1 maddy[501510]: queue: will retry        {"attempts_count":{"ignorame2@messagenet.it":1},"msg_id":"c5eb84cd","next_try_delay":"14m59.99999903s","rcpts":["ignorame2@messagenet.it"]}
Feb 18 00:09:36 srv1 maddy[501510]: queue: will retry        {"attempts_count":{"ignorame2@messagenet.it":2},"msg_id":"c5eb84cd","next_try_delay":"14m59.999997305s","rcpts":["ignorame2@messagenet.it"]}
Feb 18 00:24:36 srv1 maddy[501510]: queue: will retry        {"attempts_count":{"ignorame2@messagenet.it":3},"msg_id":"c5eb84cd","next_try_delay":"14m59.999996852s","rcpts":["ignorame2@messagenet.it"]}
Feb 18 00:39:36 srv1 maddy[501510]: queue: will retry        {"attempts_count":{"ignorame2@messagenet.it":4},"msg_id":"c5eb84cd","next_try_delay":"14m59.99999707s","rcpts":["ignorame2@messagenet.it"]}
Feb 18 00:54:36 srv1 maddy[501510]: queue: will retry        {"attempts_count":{"ignorame2@messagenet.it":5},"msg_id":"c5eb84cd","next_try_delay":"29m59.999997265s","rcpts":["ignorame2@messagenet.it"]}
Feb 18 01:24:37 srv1 maddy[501510]: queue: will retry        {"attempts_count":{"ignorame2@messagenet.it":6},"msg_id":"c5eb84cd","next_try_delay":"44m59.999996767s","rcpts":["ignorame2@messagenet.it"]}
Feb 18 02:09:37 srv1 maddy[501510]: queue: will retry        {"attempts_count":{"ignorame2@messagenet.it":7},"msg_id":"c5eb84cd","next_try_delay":"44m59.999997176s","rcpts":["ignorame2@messagenet.it"]}
Feb 18 02:54:37 srv1 maddy[501510]: queue: will retry        {"attempts_count":{"ignorame2@messagenet.it":8},"msg_id":"c5eb84cd","next_try_delay":"59m59.999996711s","rcpts":["ignorame2@messagenet.it"]}
Feb 18 03:54:37 srv1 maddy[501510]: queue: will retry        {"attempts_count":{"ignorame2@messagenet.it":9},"msg_id":"c5eb84cd","next_try_delay":"1h14m59.999997516s","rcpts":["ignorame2@messagenet.it"]}
Feb 18 05:09:38 srv1 maddy[501510]: queue: will retry        {"attempts_count":{"ignorame2@messagenet.it":10},"msg_id":"c5eb84cd","next_try_delay":"1h44m59.999997928s","rcpts":["ignorame2@messagenet.it"]}
Feb 18 06:54:38 srv1 maddy[501510]: queue: will retry        {"attempts_count":{"ignorame2@messagenet.it":11},"msg_id":"c5eb84cd","next_try_delay":"2h14m59.999998559s","rcpts":["ignorame2@messagenet.it"]}
Feb 18 09:09:38 srv1 maddy[501510]: queue: will retry        {"attempts_count":{"ignorame2@messagenet.it":12},"msg_id":"c5eb84cd","next_try_delay":"2h44m59.99995706s","rcpts":["ignorame2@messagenet.it"]}
Feb 18 11:54:38 srv1 maddy[501510]: queue: will retry        {"attempts_count":{"ignorame2@messagenet.it":13},"msg_id":"c5eb84cd","next_try_delay":"3h29m59.999994638s","rcpts":["ignorame2@messagenet.it"]}
Feb 18 15:24:38 srv1 maddy[501510]: queue: will retry        {"attempts_count":{"ignorame2@messagenet.it":14},"msg_id":"c5eb84cd","next_try_delay":"4h29m59.999997815s","rcpts":["ignorame2@messagenet.it"]}
Feb 18 19:54:39 srv1 maddy[501510]: queue: will retry        {"attempts_count":{"ignorame2@messagenet.it":15},"msg_id":"c5eb84cd","next_try_delay":"5h29m59.999998277s","rcpts":["ignorame2@messagenet.it"]}
Feb 19 01:24:39 srv1 maddy[501510]: queue: will retry        {"attempts_count":{"ignorame2@messagenet.it":16},"msg_id":"c5eb84cd","next_try_delay":"6h59m59.999996306s","rcpts":["ignorame2@messagenet.it"]}
Feb 19 08:24:39 srv1 maddy[501510]: queue: will retry        {"attempts_count":{"ignorame2@messagenet.it":17},"msg_id":"c5eb84cd","next_try_delay":"8h44m59.999997053s","rcpts":["ignorame2@messagenet.it"]}
Feb 19 17:09:39 srv1 maddy[501510]: queue: will retry        {"attempts_count":{"ignorame2@messagenet.it":18},"msg_id":"c5eb84cd","next_try_delay":"10h59m59.999998151s","rcpts":["ignorame2@messagenet.it"]}
Feb 20 04:09:40 srv1 maddy[501510]: queue: will retry        {"attempts_count":{"ignorame2@messagenet.it":19},"msg_id":"c5eb84cd","next_try_delay":"13h44m59.999995836s","rcpts":["ignorame2@messagenet.it"]}
Feb 20 17:54:40 srv1 maddy[501510]: queue: not delivered, permanent error        {"msg_id":"c5eb84cd","rcpt":"ignorame2@messagenet.it"}
lukas@srv1:~$

I did notice the the documentations talks about a factor of 1.2 while the code has 1.25, but clearly there is more to this than a simple disparity of the factor.