codership / mysql-wsrep

wsrep API patch for MySQL server
Other
65 stars 34 forks source link

xtrabackup-v2 SST donor stuck in DONOR/DESYNCED state when joiner is killed #333

Open GeoffMontee opened 6 years ago

GeoffMontee commented 6 years ago

This is related to the following MariaDB Jira issue:

https://jira.mariadb.org/browse/MDEV-15442

It looks like there is a problem with the xtrabackup-v2 SST script in which the donor node does not always detect that the joiner has died, so it can sometimes stream the backup to nowhere.

It looks like the donor executed the following command in this scenario:

WSREP_SST: [INFO] Evaluating innobackupex --no-version-check $tmpopts $INNOEXTRA --galera-info --stream=$sfmt $itmpdir 2>${DATA}/innobackup.backup.log | socat -u stdio openssl-connect:node000002512.domain.com:4444,cert=/mariadb/conf/mariadbSST.pem,key=/mariadb/conf/mariadbSST.pem,cafile=/mariadb/source/dbautils/templates//etc/ca.pem; RC=( ${PIPESTATUS[@]} ) (20180228 16:55:28.090)

Maybe the "keepalive", "connect-timeout=", and/or "linger=" options from socat's socket option group would be helpful here?

http://www.dest-unreach.org/socat/doc/socat.html#GROUP_SOCKET

Or maybe the "keepcnt=" and/or "abort-threshold=" options from socat's TCP option group?

http://www.dest-unreach.org/socat/doc/socat.html#GROUP_TCP

It also looks like the donor is determining if socat failed by checking to see if its return value was 1:

https://github.com/MariaDB/server/blob/a15ab358fc1ea75634de266fa8150b3e89ac5593/scripts/wsrep_sst_xtrabackup-v2.sh#L975

Is this a good way to determine failure? The socat manual doesn't seem to indicate that this is some special value that indicates a failure. It seems to say that any positive or negative integer could mean a failure:

"On exit, socat gives status 0 if it terminated due to EOF or inactivity timeout, with a positive value on error, and with a negative value on fatal error."

http://www.dest-unreach.org/socat/doc/socat.html#DIAGNOSTICS