codership / galera

Synchronous multi-master replication library
GNU General Public License v2.0
447 stars 177 forks source link

wsrep_sst_common check_port bug #655

Open glbyers opened 6 months ago

glbyers commented 6 months ago

A common tuning for mariadb running under systemd on Linux systems is to set LimitNOFILE to something larger than the default. In our case, we set it to infinity, which has a different meaning depending on the version of systemd;

In the check_port function, we have the following;

    if [ $lsof_available -ne 0 ]; then
        lsof -Pnl -i ":$port" 2>/dev/null | \
        grep -q -E "^($utils)[^[:space:]]*[[:space:]]+$pid[[:space:]].*\\(LISTEN\\)" && rc=0   

The problem is that lsof closes all file handles except stdin, stdout & stderr. When the nofile limit is high, this can take longer than some hard-coded timeouts. ie, in the wsrep_sst_mariabackup script, we have this in recv_joiner;

    local ltcmd="$tcmd"
    if [ $tmt -gt 0 ]; then
        if [ -n "$(commandex timeout)" ]; then
            if timeout --help | grep -qw -F -- '-k'; then
                ltcmd="timeout -k $(( tmt+10 )) $tmt $tcmd"
            else
                ltcmd="timeout -s9 $tmt $tcmd"
            fi
        fi
    fi

    if [ $wait -ne 0 ]; then
        wait_for_listen &
    fi

And in wait_for_listen;

wait_for_listen()
{
    for i in {1..150}; do
        if check_port "" "$SST_PORT" 'socat|nc'; then
            break
        fi
        sleep 0.2
    done
    echo "ready $ADDR:$SST_PORT/$MODULE/$lsn/$sst_ver"
}

So the check_port call needs to complete before the timeout configured in recv_joiner in order to signal to the donor that we're ready to receive the backup. This never occurs, because lsof is still busy closing file handles when the timeout expires. On rhel8 with LimitNOFILE=infiinity set in the systemd unit file for mariadb, everything is peachy as it's really 64k. But the same config migrated to rhel9 will result in being unable to bootstrap a cluster & there's very little in the way of logging to indicate why.

Would it be reasonable to set some sane limits within the code that calls the scripts associated with wsrep_sst_method, or perhaps to call ulimit -n 4096 or similar within the wsrepsst scripts? It really is a nasty gotcha*.