codership / galera

Synchronous multi-master replication library
GNU General Public License v2.0
447 stars 177 forks source link

SST/IST request preparation missing on initial sync #627

Open Hecke29 opened 1 year ago

Hecke29 commented 1 year ago

Using docker-image mariadb:10.9 but also mariadb:10.7. The IST Receiver opens up but WSREP: Prepared SST request: rsync|10.1.0.96:4568/rsync_sst or similar never happens. Basically the joiner never requests a transfer from any other node. So consequently the transfer never starts.

It's a docker-setup that uses bridge-networking which is complex I know: but what is the condition to the SST-Request being prepared at all? Why is it not happening?

Joiner-Logs:

mariadb_1   | 2022-10-28  9:03:00 0 [Note] WSREP: Shifting OPEN -> PRIMARY (TO: 4804590)
mariadb_1   | 2022-10-28  9:03:00 2 [Note] WSREP: ####### processing CC 4804590, local, ordered
mariadb_1   | 2022-10-28  9:03:00 2 [Note] WSREP: Process first view: 67936391-c563-11ec-8631-f70270539624 my uuid: 524090c4-569f-11ed-8a56-ab7e738222ef
mariadb_1   | 2022-10-28  9:03:00 2 [Note] WSREP: Server mariadb-1 connected to cluster at position 67936391-c563-11ec-8631-f70270539624:4804590 with ID 524090c4-569f-11ed-8a56-ab7e738222ef
mariadb_1   | 2022-10-28  9:03:00 2 [Note] WSREP: Server status change disconnected -> connected
mariadb_1   | 2022-10-28  9:03:00 2 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
mariadb_1   | 2022-10-28  9:03:00 2 [Note] WSREP: ####### My UUID: 524090c4-569f-11ed-8a56-ab7e738222ef
mariadb_1   | 2022-10-28  9:03:00 2 [Note] WSREP: Cert index reset to 00000000-0000-0000-0000-000000000000:-1 (proto: 10), state transfer needed: yes
mariadb_1   | 2022-10-28  9:03:00 0 [Note] WSREP: Service thread queue flushed.
mariadb_1   | 2022-10-28  9:03:00 2 [Note] WSREP: ####### Assign initial position for certification: 00000000-0000-0000-0000-000000000000:-1, protocol version: -1
mariadb_1   | 2022-10-28  9:03:00 2 [Note] WSREP: State transfer required: 
mariadb_1   |   Group state: 67936391-c563-11ec-8631-f70270539624:4804590
mariadb_1   |   Local state: 00000000-0000-0000-0000-000000000000:-1
mariadb_1   | 2022-10-28  9:03:00 2 [Note] WSREP: Server status change connected -> joiner
mariadb_1   | 2022-10-28  9:03:00 2 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
mariadb_1   | 2022-10-28  9:03:00 0 [Note] WSREP: Joiner monitor thread started to monitor
mariadb_1   | 2022-10-28  9:03:00 0 [Note] WSREP: Running: 'wsrep_sst_rsync --role 'joiner' --address 'mariadb-1.mariadb-galera-cluster' --datadir '/var/lib/mysql/' --parent 1 --progress 0'
mariadb_1   | WSREP_SST: [INFO] rsync SST started on joiner (20221028 09:03:00.889)
mariadb_1   | 2022-10-28  9:03:01 2 [Note] WSREP: ####### IST uuid:00000000-0000-0000-0000-000000000000 f: 0, l: 4804590, STRv: 3
mariadb_1   | 2022-10-28  9:03:01 2 [Note] WSREP: IST receiver addr using tcp://10.1.0.96:4568
mariadb_1   | 2022-10-28  9:03:01 2 [Note] WSREP: IST receiver bind using tcp://0.0.0.0:4568
mariadb_1   | 2022-10-28  9:03:01 2 [Note] WSREP: Prepared IST receiver for 0-4804590, listening at: tcp://0.0.0.0:4568
mariadb_1   | 2022-10-28  9:03:03 0 [Note] WSREP: (524090c4-8a56, 'tcp://0.0.0.0:4567') turning message relay requesting off

Output of ps aux:

USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
mysql          1  0.1  0.4 834088 34448 ?        Ssl  09:02   0:00 mariadbd
mysql         65  0.0  0.0   2880   996 ?        S    09:03   0:00 sh -c wsrep_sst_rsync --role 'joiner' --address 'mariadb-1.mariadb-galera-cluster' --datadir '/var/lib/mysql/' --parent 1 --pr
mysql         66  0.1  0.0   5012  3932 ?        S    09:03   0:00 bash /usr//bin/wsrep_sst_rsync --role joiner --address mariadb-1.mariadb-galera-cluster --datadir /var/lib/mysql/ --parent 1 -
mysql        191  0.0  0.0   8920  2700 ?        S    09:03   0:00 rsync --daemon --no-detach --port 4444 --config /var/lib/mysql/rsync_sst.conf
mysql       1239  0.0  0.0   2780  1052 ?        S    09:06   0:00 sleep 1
root        1240  0.0  0.0   7052  1560 pts/0    R+   09:06   0:00 ps aux

I already confirmed the wsrep_sst_rsync script is waiting for the MAGIC_FILE here:

while [ ! -r "$MAGIC_FILE" ] && check_pid "$TRANSFER_PID" && \
          ps -p $MYSQLD_PID >/dev/null 2>&1
do
        sleep 1
done
ayurchen commented 1 year ago

Hi, Just before that line you have

    echo "ready $ADDR:$RSYNC_PORT/$MODULE"

when the parent process (mysqld) reads that, it will send SST request to cluster and a donor will be chosen, that should be reflected in the error log. If there is nothing more in the error log after the lines that you posted, then something unrelated to networking happened. If this node managed to connect to cluster, then networking is fine. What's in the error log on other nodes?