It looks like there is a problem with the xtrabackup-v2 SST script in which the donor node does not always detect that the joiner has died, so it can sometimes stream the backup to nowhere.
It looks like the donor executed the following command in this scenario:
Is this a good way to determine failure? The socat manual doesn't seem
to indicate that this is some special value that indicates a failure.
It seems to say that any positive or negative integer could mean a
failure:
"On exit, socat gives status 0 if it terminated due to EOF or
inactivity timeout, with a positive value on error, and with a
negative value on fatal error."
This is related to the following MariaDB Jira issue:
https://jira.mariadb.org/browse/MDEV-15442
It looks like there is a problem with the xtrabackup-v2 SST script in which the donor node does not always detect that the joiner has died, so it can sometimes stream the backup to nowhere.
It looks like the donor executed the following command in this scenario:
Maybe the "keepalive", "connect-timeout=", and/or
"linger=" options from socat's socket option group would
be helpful here?
http://www.dest-unreach.org/socat/doc/socat.html#GROUP_SOCKET
Or maybe the "keepcnt=" and/or "abort-threshold="
options from socat's TCP option group?
http://www.dest-unreach.org/socat/doc/socat.html#GROUP_TCP
It also looks like the donor is determining if socat failed by checking to see if its return value was 1:
https://github.com/MariaDB/server/blob/a15ab358fc1ea75634de266fa8150b3e89ac5593/scripts/wsrep_sst_xtrabackup-v2.sh#L975
Is this a good way to determine failure? The socat manual doesn't seem to indicate that this is some special value that indicates a failure. It seems to say that any positive or negative integer could mean a failure:
"On exit, socat gives status 0 if it terminated due to EOF or inactivity timeout, with a positive value on error, and with a negative value on fatal error."
http://www.dest-unreach.org/socat/doc/socat.html#DIAGNOSTICS