Closed mayask closed 7 years ago
Hi
I'm testing out the ability of Galera cluster to recover a data node.
The node recovery script hangs when it starts collecting info from other nodes. Here is the TRACE output
mysqld.sh: Attempting to recover GTID positon... [39/1845] ++ mktemp -t wsrep_recover.XXXXXX + tmpfile=/tmp/wsrep_recover.onIjcq + mysqld --wsrep-on=ON --wsrep_sst_method=skip --wsrep_cluster_address=gcomm:// --skip-networking --wsrep-recover + '[' 0 -ne 0 ']' + grep WSREP /tmp/wsrep_recover.onIjcq 2017-01-23 18:18:30 140129296246720 [Note] WSREP: Recovered position: 9393f3f9-e18d-11e6-916a-8bd9bb399557:195 + echo 'mysqld.sh: --------------------------------------------------' mysqld.sh: -------------------------------------------------- ++ sed -n 's/.*WSREP: Recovered position:\s*//p' /tmp/wsrep_recover.onIjcq + POSITION=9393f3f9-e18d-11e6-916a-8bd9bb399557:195 + rm -f /tmp/wsrep_recover.onIjcq ++ sed -E 's#.*--wsrep_node_address=([0-9\.]+):4567.*#\1#' + NODE_ADDRESS=10.1.58.4 [26/1845] ++ sed -E 's#.*gcomm://([0-9\.,]+)\s+.*#\1#' + GCOMM='--console --wsrep-on=ON --wsrep_cluster_name=dbcluster --wsrep_cluster_address=gcomm://10.0.0.173,10.0.0.68?pc.wait_prim=no --wsrep_node_address=10.1.58.4:4567 --wsrep-gtid-mode=1 --wsrep- gtid-domain-id=1 --wsrep_sst_auth=xtrabackup:123 --default-time-zone=+00:00' + [[ -f /var/lib/mysql/wsrep-new-cluster ]] + [[ -z 9393f3f9-e18d-11e6-916a-8bd9bb399557:195 ]] + check_nodes --console --wsrep-on=ON --wsrep_cluster_name=dbcluster '--wsrep_cluster_address=gcomm://10.0.0.173,10.0.0.68?pc.wait_prim=no' --wsrep_node_address=10.1.58.4:4567 --wsrep-gtid-mode=1 - -wsrep-gtid-domain-id=1 --wsrep_sst_auth=xtrabackup:123 --default-time-zone=+00:00 10.1.58.4 + for node in '${1//,/ }' + '[' --console = --wsrep-on=ON ']' + curl -f -s -o - http://--console:8081 + return 1 + LISTEN_PORT=3309 + EXPECT_NODES=3 [13/1845] + [[ -f /var/lib/mysql/gvwstate.dat ]] ++ awk '/^view_id:/{print $2 " " $3 " " $4}' + VIEW_ID='3 d9304629-e196-11e6-8069-5b4e96a363da 4' ++ grep '^member:' ++ wc -l + GVW_MEMBERS=2 + echo 'mysqld.sh: Found view from gvwstate.dat with (2) members: 3 d9304629-e196-11e6-8069-5b4e96a363da 4' mysqld.sh: Found view from gvwstate.dat with (2) members: 3 d9304629-e196-11e6-8069-5b4e96a363da 4 + [[ 2 -gt 3 ]] ++ wc -w + [[ 10 -gt 3 ]] ++ wc -w + EXPECT_NODES=10 + EXPECT_NODES=9 + echo 'mysqld.sh: Collecting grastate.dat and gvwstate.dat info from other nodes...' mysqld.sh: Collecting grastate.dat and gvwstate.dat info from other nodes... + set -m ++ mktemp -t socat.XXXX + tmpfile=/tmp/socat.ba40 + PID_SERVER=94 + SENT_NODES= + for i in '{36..0}' + for node in '${GCOMM//,/ }' + [[ --console = 10.1.58.4 ]] + socat - TCP:--console:3309 + socat -u TCP-LISTEN:3309,bind=10.1.58.4,fork OPEN:/tmp/socat.ba40,append
The exact command in question can be seen here
root@galera-node-1:/# ps auxw USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND root 1 0.0 0.0 20124 20 ? Ss 18:18 0:00 /bin/bash /usr/local/bin/start.sh node galera-seed,galera-node root 35 0.0 0.0 7480 940 ? Sl 18:18 0:00 galera-healthcheck -user=system -password=181210f8f9c779c26da1d9b20 root 36 0.0 0.0 7480 552 ? Sl 18:18 0:00 galera-healthcheck -user=system -password=181210f8f9c779c26da1d9b20 mysql 39 0.0 0.0 20180 308 ? S 18:18 0:00 /bin/bash /usr/local/bin/mysqld.sh --console --wsrep-on=ON --wsrep_ mysql 94 0.0 0.0 19636 352 ? S 18:18 0:00 socat -u TCP-LISTEN:3309,bind=10.1.58.4,fork OPEN:/tmp/socat.ba40,a mysql 95 74.1 0.0 25944 400 ? R 18:18 29:29 socat - TCP:--console:3309 root 349 0.0 0.1 20252 1108 ? Ss 18:40 0:00 bash root 565 0.0 0.1 17492 1144 ? R+ 18:58 0:00 ps auxw
Could you please tell me if it's due to default timeout settings being too big?
If so, what would be an appropriate timeout to set here? I can submit a PR for this.
Thanks a lot
Looks like socat - TCP:--console:3309 is not correct.
socat - TCP:--console:3309
My bad. I've edited the script and it was no longer working
Hi
I'm testing out the ability of Galera cluster to recover a data node.
The node recovery script hangs when it starts collecting info from other nodes. Here is the TRACE output
The exact command in question can be seen here
Could you please tell me if it's due to default timeout settings being too big?
If so, what would be an appropriate timeout to set here? I can submit a PR for this.
Thanks a lot