socat communication with other nodes is taking forever

mayask commented 7 years ago

Hi

I'm testing out the ability of Galera cluster to recover a data node.

The node recovery script hangs when it starts collecting info from other nodes. Here is the TRACE output

mysqld.sh: Attempting to recover GTID positon...                                                                                                                                            [39/1845]
++ mktemp -t wsrep_recover.XXXXXX
+ tmpfile=/tmp/wsrep_recover.onIjcq
+ mysqld --wsrep-on=ON --wsrep_sst_method=skip --wsrep_cluster_address=gcomm:// --skip-networking --wsrep-recover
+ '[' 0 -ne 0 ']'
+ grep WSREP /tmp/wsrep_recover.onIjcq
2017-01-23 18:18:30 140129296246720 [Note] WSREP: Recovered position: 9393f3f9-e18d-11e6-916a-8bd9bb399557:195
+ echo 'mysqld.sh: --------------------------------------------------'
mysqld.sh: --------------------------------------------------
++ sed -n 's/.*WSREP: Recovered position:\s*//p' /tmp/wsrep_recover.onIjcq
+ POSITION=9393f3f9-e18d-11e6-916a-8bd9bb399557:195
+ rm -f /tmp/wsrep_recover.onIjcq
++ sed -E 's#.*--wsrep_node_address=([0-9\.]+):4567.*#\1#'
+ NODE_ADDRESS=10.1.58.4                                                                                                                                                                    [26/1845]
++ sed -E 's#.*gcomm://([0-9\.,]+)\s+.*#\1#'
+ GCOMM='--console --wsrep-on=ON --wsrep_cluster_name=dbcluster --wsrep_cluster_address=gcomm://10.0.0.173,10.0.0.68?pc.wait_prim=no --wsrep_node_address=10.1.58.4:4567 --wsrep-gtid-mode=1 --wsrep-
gtid-domain-id=1 --wsrep_sst_auth=xtrabackup:123 --default-time-zone=+00:00'
+ [[ -f /var/lib/mysql/wsrep-new-cluster ]]
+ [[ -z 9393f3f9-e18d-11e6-916a-8bd9bb399557:195 ]]
+ check_nodes --console --wsrep-on=ON --wsrep_cluster_name=dbcluster '--wsrep_cluster_address=gcomm://10.0.0.173,10.0.0.68?pc.wait_prim=no' --wsrep_node_address=10.1.58.4:4567 --wsrep-gtid-mode=1 -
-wsrep-gtid-domain-id=1 --wsrep_sst_auth=xtrabackup:123 --default-time-zone=+00:00 10.1.58.4
+ for node in '${1//,/ }'
+ '[' --console = --wsrep-on=ON ']'
+ curl -f -s -o - http://--console:8081
+ return 1
+ LISTEN_PORT=3309
+ EXPECT_NODES=3                                                                                                                                                                            [13/1845]
+ [[ -f /var/lib/mysql/gvwstate.dat ]]
++ awk '/^view_id:/{print $2 " " $3 " " $4}'
+ VIEW_ID='3 d9304629-e196-11e6-8069-5b4e96a363da 4'
++ grep '^member:'
++ wc -l
+ GVW_MEMBERS=2
+ echo 'mysqld.sh: Found view from gvwstate.dat with (2) members: 3 d9304629-e196-11e6-8069-5b4e96a363da 4'
mysqld.sh: Found view from gvwstate.dat with (2) members: 3 d9304629-e196-11e6-8069-5b4e96a363da 4
+ [[ 2 -gt 3 ]]
++ wc -w
+ [[ 10 -gt 3 ]]
++ wc -w
+ EXPECT_NODES=10
+ EXPECT_NODES=9
+ echo 'mysqld.sh: Collecting grastate.dat and gvwstate.dat info from other nodes...'
mysqld.sh: Collecting grastate.dat and gvwstate.dat info from other nodes...
+ set -m
++ mktemp -t socat.XXXX
+ tmpfile=/tmp/socat.ba40
+ PID_SERVER=94
+ SENT_NODES=
+ for i in '{36..0}'
+ for node in '${GCOMM//,/ }'
+ [[ --console = 10.1.58.4 ]]
+ socat - TCP:--console:3309
+ socat -u TCP-LISTEN:3309,bind=10.1.58.4,fork OPEN:/tmp/socat.ba40,append

The exact command in question can be seen here

root@galera-node-1:/# ps auxw 
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root         1  0.0  0.0  20124    20 ?        Ss   18:18   0:00 /bin/bash /usr/local/bin/start.sh node galera-seed,galera-node
root        35  0.0  0.0   7480   940 ?        Sl   18:18   0:00 galera-healthcheck -user=system -password=181210f8f9c779c26da1d9b20
root        36  0.0  0.0   7480   552 ?        Sl   18:18   0:00 galera-healthcheck -user=system -password=181210f8f9c779c26da1d9b20
mysql       39  0.0  0.0  20180   308 ?        S    18:18   0:00 /bin/bash /usr/local/bin/mysqld.sh --console --wsrep-on=ON --wsrep_
mysql       94  0.0  0.0  19636   352 ?        S    18:18   0:00 socat -u TCP-LISTEN:3309,bind=10.1.58.4,fork OPEN:/tmp/socat.ba40,a
mysql       95 74.1  0.0  25944   400 ?        R    18:18  29:29 socat - TCP:--console:3309
root       349  0.0  0.1  20252  1108 ?        Ss   18:40   0:00 bash
root       565  0.0  0.1  17492  1144 ?        R+   18:58   0:00 ps auxw

Could you please tell me if it's due to default timeout settings being too big?

If so, what would be an appropriate timeout to set here? I can submit a PR for this.

Thanks a lot

mayask commented 7 years ago

Looks like socat - TCP:--console:3309 is not correct.

mayask commented 7 years ago

My bad. I've edited the script and it was no longer working

colinmollenhour / mariadb-galera-swarm

socat communication with other nodes is taking forever #10