Closed rmoesbergen closed 7 years ago
After some further investigation, here's what I think happened:
End result: mailscanner is not running and not started anymore.
Could you please provide the reason for connectivity being lost such that we can understand how to script this, was it network level connectivity loss or application (db) level ?
It was a network level event (a router disappeared), so a connection to the database failed with 'connection timed out'
P.S mailscanner will still startup even when the db is down.
Well, the thing is, it didn't... After the router was back, mailscanner was still down. I had to manually start it (which was hours later).
Okay, your reproduction above does not match the actual conditions in which this error is triggered but i know what the issue is, working on a fix for it.
I cannot replicate this at all having tried so many scenarios however i have added a check for the return value to startup mailscanner if it is not running.
--- a/bin/baruwa-check-bs.sh
+++ b/bin/baruwa-check-bs.sh
@@ -25,13 +25,14 @@ export PATH
[ -x /usr/lib64/nagios/plugins/check_tcp ] || exit 1
/sbin/service mailscanner status
+retval="$?"
-if [ "$?" = "0" ]; then
+if [ "${retval}" = "0" ]; then
/usr/lib64/nagios/plugins/check_tcp -H /var/lib/baruwa/mailscanner/baruwa.sock >/dev/null || {
>&2 echo "BSQL service not running, restarting mailscanner"
/sbin/service mailscanner stopms
for pid in $(/usr/bin/pgrep MailScanner); do
- /bin/kill -9 "${pid}"
+ /bin/kill -9 "${pid}" 2>/dev/null
done
/bin/rm -f /var/lib/baruwa/mailscanner/baruwa-bs.pid
/bin/rm -f /var/lib/baruwa/mailscanner/baruwa.sock
@@ -40,4 +41,9 @@ if [ "$?" = "0" ]; then
}
fi
+if [ "${retval}" = "3" ]; then
+ >&2 echo "MailScanner service not running, starting mailscanner"
+ /sbin/service mailscanner startms
+fi
+
exit 0
Could you push an rpm with this fix? We've already had some downtime because of this.
Please apply the patch above, we only release rpms after an extensive qa process.
Rpm updates pushed
In 2.1.5 there was an update to the way the BSQL process check / recovery works, through a new cronjob that runs /usr/bin/baruwa-check-bs.sh.
When only BSQL is killed, this seems to work. However, we had a connectivity issue to the database server, and mailscanner did not recover from it. The logging was:
May 4 01:17:41 lnx2682vm MailScanner[30452]: BaruwaSQL: DB init Failed: TIMEOUT May 4 01:17:51 lnx2682vm MailScanner[30452]: BaruwaSQL: Search DB connection Failed: TIMEOUTs May 4 01:17:51 lnx2682vm MailScanner[30452]: BaruwaSQL: 1d63Gb-0008HT-IS: => root@mx01-azg.solvinity.com Logged to Backup May 4 01:18:00 lnx2682vm MailScanner[32217]: I have found clamd sophos scanners installed, and will use them all by default. May 4 01:18:00 lnx2682vm MailScanner[32217]: Using locktype = posix May 4 01:20:01 lnx2682vm MailScanner[32086]: Locked message: 1d63Yv-0008P5-HT May 4 01:20:01 lnx2682vm MailScanner[32086]: New Batch: Scanning 1 messages, 1201 bytes May 4 01:20:01 lnx2682vm MailScanner[32086]: Unscanned: Delivered 1 messages May 4 01:20:01 lnx2682vm MailScanner[32086]: BaruwaSQL: 1d63Yv-0008P5-HT sent to backup May 4 01:20:03 lnx2682vm MailScanner[32091]: MailScanner child caught a SIGHUP May 4 01:20:03 lnx2682vm MailScanner[32091]: Config: calling custom end function BaruwaLog May 4 01:20:03 lnx2682vm MailScanner[32091]: BaruwaSQL: shutdown requested by child: 32091 May 4 01:20:03 lnx2682vm MailScanner[32217]: MailScanner child caught a SIGHUP May 4 01:20:03 lnx2682vm MailScanner[32217]: Config: calling custom end function BaruwaLog May 4 01:20:03 lnx2682vm MailScanner[32217]: BaruwaSQL: shutdown requested by child: 32217 May 4 01:20:03 lnx2682vm MailScanner[32098]: MailScanner child caught a SIGHUP May 4 01:20:03 lnx2682vm MailScanner[32098]: Config: calling custom end function BaruwaLog May 4 01:20:03 lnx2682vm MailScanner[32098]: BaruwaSQL: shutdown requested by child: 32098
After this, mailscanner was stopped and no mail was processed, even after the connection with the DB servers was restored.
How to reproduce:
Expected behavior: mailscanner is started again.
Actual behavior: bash -x output:
Mailscanner is not started because the exit status is 3, not 0. In 2.1.4, when mailscanner died, I recall that it would be restarted unconditionally. Can this be fixed?