baruwaproject / baruwa2

Baruwa 2.0
http://www.baruwa.org
GNU General Public License v3.0
21 stars 9 forks source link

Bug: Mailscanner does not recover after db connection error #125

Closed rmoesbergen closed 7 years ago

rmoesbergen commented 7 years ago

In 2.1.5 there was an update to the way the BSQL process check / recovery works, through a new cronjob that runs /usr/bin/baruwa-check-bs.sh.

When only BSQL is killed, this seems to work. However, we had a connectivity issue to the database server, and mailscanner did not recover from it. The logging was:

May 4 01:17:41 lnx2682vm MailScanner[30452]: BaruwaSQL: DB init Failed: TIMEOUT May 4 01:17:51 lnx2682vm MailScanner[30452]: BaruwaSQL: Search DB connection Failed: TIMEOUTs May 4 01:17:51 lnx2682vm MailScanner[30452]: BaruwaSQL: 1d63Gb-0008HT-IS: => root@mx01-azg.solvinity.com Logged to Backup May 4 01:18:00 lnx2682vm MailScanner[32217]: I have found clamd sophos scanners installed, and will use them all by default. May 4 01:18:00 lnx2682vm MailScanner[32217]: Using locktype = posix May 4 01:20:01 lnx2682vm MailScanner[32086]: Locked message: 1d63Yv-0008P5-HT May 4 01:20:01 lnx2682vm MailScanner[32086]: New Batch: Scanning 1 messages, 1201 bytes May 4 01:20:01 lnx2682vm MailScanner[32086]: Unscanned: Delivered 1 messages May 4 01:20:01 lnx2682vm MailScanner[32086]: BaruwaSQL: 1d63Yv-0008P5-HT sent to backup May 4 01:20:03 lnx2682vm MailScanner[32091]: MailScanner child caught a SIGHUP May 4 01:20:03 lnx2682vm MailScanner[32091]: Config: calling custom end function BaruwaLog May 4 01:20:03 lnx2682vm MailScanner[32091]: BaruwaSQL: shutdown requested by child: 32091 May 4 01:20:03 lnx2682vm MailScanner[32217]: MailScanner child caught a SIGHUP May 4 01:20:03 lnx2682vm MailScanner[32217]: Config: calling custom end function BaruwaLog May 4 01:20:03 lnx2682vm MailScanner[32217]: BaruwaSQL: shutdown requested by child: 32217 May 4 01:20:03 lnx2682vm MailScanner[32098]: MailScanner child caught a SIGHUP May 4 01:20:03 lnx2682vm MailScanner[32098]: Config: calling custom end function BaruwaLog May 4 01:20:03 lnx2682vm MailScanner[32098]: BaruwaSQL: shutdown requested by child: 32098

After this, mailscanner was stopped and no mail was processed, even after the connection with the DB servers was restored.

How to reproduce:

  1. killall MailScanner
  2. run baruwa-check-bs.sh

Expected behavior: mailscanner is started again.

Actual behavior: bash -x output:

Mailscanner is not started because the exit status is 3, not 0. In 2.1.4, when mailscanner died, I recall that it would be restarted unconditionally. Can this be fixed?

rmoesbergen commented 7 years ago

After some further investigation, here's what I think happened:

  1. Connectivity to the database was lost, which caused BSQL to exit
  2. The cron recovery script was run, detected that BSQL was not running, and attempted a restart. However, it calls this: /sbin/service mailscanner stopms Which kills all mailscanner processes and places a file called /var/lock/subsys/MailScanner.off, so now mailscanner is completely dead.
  3. The startup of mailscanner failed, because there was still no connection to the database.
  4. Since now the file /var/lock/subsys/MailScanner.off exists, any subsequent runs of the recovery script will just exit.

End result: mailscanner is not running and not started anymore.

akissa commented 7 years ago

Could you please provide the reason for connectivity being lost such that we can understand how to script this, was it network level connectivity loss or application (db) level ?

rmoesbergen commented 7 years ago

It was a network level event (a router disappeared), so a connection to the database failed with 'connection timed out'

akissa commented 7 years ago

P.S mailscanner will still startup even when the db is down.

rmoesbergen commented 7 years ago

Well, the thing is, it didn't... After the router was back, mailscanner was still down. I had to manually start it (which was hours later).

akissa commented 7 years ago

Okay, your reproduction above does not match the actual conditions in which this error is triggered but i know what the issue is, working on a fix for it.

akissa commented 7 years ago

I cannot replicate this at all having tried so many scenarios however i have added a check for the return value to startup mailscanner if it is not running.

--- a/bin/baruwa-check-bs.sh
+++ b/bin/baruwa-check-bs.sh
@@ -25,13 +25,14 @@ export PATH
 [ -x /usr/lib64/nagios/plugins/check_tcp ] || exit 1

 /sbin/service mailscanner status
+retval="$?"

-if [ "$?" = "0" ]; then
+if [ "${retval}" = "0" ]; then
     /usr/lib64/nagios/plugins/check_tcp -H /var/lib/baruwa/mailscanner/baruwa.sock >/dev/null || {
         >&2 echo "BSQL service not running, restarting mailscanner"
         /sbin/service mailscanner stopms
         for pid in $(/usr/bin/pgrep MailScanner); do
-            /bin/kill -9 "${pid}"
+            /bin/kill -9 "${pid}" 2>/dev/null
         done
         /bin/rm -f /var/lib/baruwa/mailscanner/baruwa-bs.pid
         /bin/rm -f /var/lib/baruwa/mailscanner/baruwa.sock
@@ -40,4 +41,9 @@ if [ "$?" = "0" ]; then
     }
 fi

+if [ "${retval}" = "3" ]; then
+    >&2 echo "MailScanner service not running, starting mailscanner"
+    /sbin/service mailscanner startms
+fi
+
 exit 0
rmoesbergen commented 7 years ago

Could you push an rpm with this fix? We've already had some downtime because of this.

akissa commented 7 years ago

Please apply the patch above, we only release rpms after an extensive qa process.

akissa commented 7 years ago

Rpm updates pushed