codership / galera

Synchronous multi-master replication library
GNU General Public License v2.0
447 stars 176 forks source link

garbd sst backup using xtrabackup-v2 close socket before the backup is being sent #397

Open menardorama opened 8 years ago

menardorama commented 8 years ago

Hi,

I'm trying to implement a centralized backup from garbd using the wsrep_backup (xtrabackup-v2)

From what I see in the garbd output and the error.log on my galera node is that garbd is successfully asking for a sst backup and the node begin the SST.

I define the garbd command line as follow : `/usr/bin/garbd -a gcomm://mysql3:4567?gmcast.listen_addr=tcp://0.0.0.0:4444 --group My_Cluster --donor mysql3 -n $(hostname) --sst 'xtrabackup-v2:192.168.0.1:4444:1:1:10' 2016-04-07 11:17:15.179 INFO: CRC-32C: using hardware acceleration. 2016-04-07 11:17:15.180 INFO: Read config: daemon: 0 name: mysql-garbd address: gcomm://my_galera_node:4567?gmcast.listen_addr=tcp://0.0.0.0:4444 group: My_Cluster sst: xtrabackup-v2:192.168.0.1:4444:1:1:10 donor: my_galera_node options: gcs.fc_limit=9999999; gcs.fc_factor=1.0; gcs.fc_master_slave=yes cfg:
log:

2016-04-07 11:17:15.182 INFO: protonet asio version 0 2016-04-07 11:17:15.183 INFO: Using CRC-32C for message checksums. 2016-04-07 11:17:15.183 INFO: backend: asio 2016-04-07 11:17:15.184 WARN: access file(./gvwstate.dat) failed(No such file or directory) 2016-04-07 11:17:15.184 INFO: restore pc from disk failed 2016-04-07 11:17:15.185 INFO: GMCast version 0 2016-04-07 11:17:15.187 INFO: (84c0f869, 'tcp://0.0.0.0:4444') listening at tcp://0.0.0.0:4444 2016-04-07 11:17:15.187 INFO: (84c0f869, 'tcp://0.0.0.0:4444') multicast: , ttl: 1 2016-04-07 11:17:15.188 INFO: EVS version 0 2016-04-07 11:17:15.189 INFO: gcomm: connecting to group 'Pasteur_Cluster', peer 'mysql3.it.pasteur.fr:4567' 2016-04-07 11:17:15.191 INFO: (84c0f869, 'tcp://0.0.0.0:4444') turning message relay requesting on, nonlive peers: tcp://192.168.0.2:4567 tcp://192.168.0.2:4567 tcp://192.168.0.1:4567 2016-04-07 11:17:15.248 INFO: declaring 49330e7e at tcp://192.168.0.1:4567 stable 2016-04-07 11:17:15.248 INFO: declaring 584ed1f8 at tcp://192.168.0.2:4567 stable 2016-04-07 11:17:15.248 INFO: declaring 64598718 at tcp://192.168.0.3:4567 stable 2016-04-07 11:17:15.248 INFO: declaring 716c1463 at tcp://192.168.0.4:4567 stable 2016-04-07 11:17:15.249 INFO: Node 49330e7e state prim 2016-04-07 11:17:15.250 INFO: view(view_id(PRIM,49330e7e,446) memb { 49330e7e,0 584ed1f8,0 64598718,0 716c1463,0 84c0f869,0 } joined { } left { } partitioned { }) 2016-04-07 11:17:15.250 INFO: save pc into disk 2016-04-07 11:17:15.690 INFO: gcomm: connected 2016-04-07 11:17:15.690 INFO: Changing maximum packet size to 64500, resulting msg size: 32636 2016-04-07 11:17:15.690 INFO: Shifting CLOSED -> OPEN (TO: 0) 2016-04-07 11:17:15.690 INFO: Opened channel 'My_Cluster' 2016-04-07 11:17:15.691 INFO: New COMPONENT: primary = yes, bootstrap = no, my_idx = 4, memb_num = 5 2016-04-07 11:17:15.691 INFO: STATE EXCHANGE: Waiting for state UUID. 2016-04-07 11:17:15.691 INFO: STATE EXCHANGE: sent state msg: 84cb38cf-fca1-11e5-975f-6bd1bb09e1eb 2016-04-07 11:17:15.691 INFO: STATE EXCHANGE: got state msg: 84cb38cf-fca1-11e5-975f-6bd1bb09e1eb from 0 (garb) 2016-04-07 11:17:15.691 INFO: STATE EXCHANGE: got state msg: 84cb38cf-fca1-11e5-975f-6bd1bb09e1eb from 1 (mysql1) 2016-04-07 11:17:15.691 INFO: STATE EXCHANGE: got state msg: 84cb38cf-fca1-11e5-975f-6bd1bb09e1eb from 2 (mysql2) 2016-04-07 11:17:15.691 INFO: STATE EXCHANGE: got state msg: 84cb38cf-fca1-11e5-975f-6bd1bb09e1eb from 3 (mysql3) 2016-04-07 11:17:15.692 INFO: STATE EXCHANGE: got state msg: 84cb38cf-fca1-11e5-975f-6bd1bb09e1eb from 4 (mysql-garbd) 2016-04-07 11:17:15.692 INFO: Quorum results: version = 3, component = PRIMARY, conf_id = 439, members = 4/5 (joined/total), act_id = 161946, last_appl. = -1, protocols = 0/7/3 (gcs/repl/appl), group UUID = 3c0da4ec-f809-11e5-896b-06e1ccff7168 2016-04-07 11:17:15.692 INFO: Flow-control interval: [9999999, 9999999] 2016-04-07 11:17:15.692 INFO: Shifting OPEN -> PRIMARY (TO: 161946) 2016-04-07 11:17:15.692 INFO: Sending state transfer request: 'xtrabackup-v2:192.168.0.1:4444:1:1:10', size: 40 2016-04-07 11:17:15.693 INFO: Member 4.0 (mysql-garbd) requested state transfer from 'mysql3'. Selected 3.0 (mysql3)(SYNCED) as donor. 2016-04-07 11:17:15.693 INFO: Shifting PRIMARY -> JOINER (TO: 161946) 2016-04-07 11:17:15.694 INFO: Closing send monitor... 2016-04-07 11:17:15.694 INFO: Closed send monitor. 2016-04-07 11:17:15.694 INFO: gcomm: terminating thread 2016-04-07 11:17:15.694 INFO: gcomm: joining thread 2016-04-07 11:17:15.694 INFO: gcomm: closing backend 2016-04-07 11:17:15.696 INFO: view(view_id(NON_PRIM,49330e7e,446) memb { 84c0f869,0 } joined { } left { } partitioned { 49330e7e,0 584ed1f8,0 64598718,0 716c1463,0 }) 2016-04-07 11:17:15.696 INFO: view((empty)) 2016-04-07 11:17:15.697 INFO: 4.0 (mysql-garbd): State transfer from 3.0 (mysql3) complete. 2016-04-07 11:17:15.697 INFO: Shifting JOINER -> JOINED (TO: 161946) 2016-04-07 11:17:15.697 INFO: gcomm: closed 2016-04-07 11:17:15.698 WARN: 0x168e180 down context(s) not set 2016-04-07 11:17:15.698 WARN: Failed to send SYNC signal: -107 (Transport endpoint is not connected) 2016-04-07 11:17:15.698 INFO: New COMPONENT: primary = no, bootstrap = no, my_idx = 0, memb_num = 1 2016-04-07 11:17:15.698 INFO: Flow-control interval: [9999999, 9999999] 2016-04-07 11:17:15.698 INFO: Received NON-PRIMARY. 2016-04-07 11:17:15.698 INFO: Shifting JOINED -> OPEN (TO: 161946) 2016-04-07 11:17:15.698 INFO: Received self-leave message. 2016-04-07 11:17:15.698 INFO: Flow-control interval: [9999999, 9999999] 2016-04-07 11:17:15.698 INFO: Received SELF-LEAVE. Closing connection. 2016-04-07 11:17:15.698 INFO: Shifting OPEN -> CLOSED (TO: 161946) 2016-04-07 11:17:15.698 INFO: RECV thread exiting 0: Success 2016-04-07 11:17:15.698 INFO: recv_thread() joined. 2016-04-07 11:17:15.698 INFO: Closing replication queue. 2016-04-07 11:17:15.698 INFO: Closing slave action queue. 2016-04-07 11:17:15.698 WARN: Attempt to close a closed connection 2016-04-07 11:17:15.698 INFO: Exiting main loop 2016-04-07 11:17:15.698 INFO: Shifting CLOSED -> DESTROYED (TO: 161946)`

From what I see is that the backup is starting but the garbd process exit without waiting for hte backup being transfered.

On the donor side, the backup is done but transfer is failing because the socket is no longer listening.

Am I right ? Is there an option to pass in order that gardb keep listening ?

Thanks

utdrmac commented 8 years ago

gardb does not contain a complete set of your data. garbd only acts a mediator/go-between/voting-member/etc. It only has a gcache, not actual data. Full data sets only exist on a running mysql node, not a garbd node.

menardorama commented 8 years ago

Hi

Thanks for the reply but I succeed in implementing the backup using garbd.

I had to run gardb like that garbd -a gcomm://NODE_NAME:NODE_PORT?gmcast.listen_addr=tcp://0.0.0.0:$LISTEN_PORT --group $CLUSTER_NAME --donor $NODE_NAME -n $HOSTNAME --sst "xtrabackup-v2:my_server::1:1:10"

And in the same script I run 2 socat process sequentially

socat -u TCP-LISTEN:$LISTEN_PORT,reuseaddr stdio > xtrabackup_galera_info socat -u TCP-LISTEN:$LISTEN_PORT,reuseaddr stdio > xtrabackup

And it's OK but I know it's not perfect.... at least for a one time backup; I'll have to come back in order to suceed in binlog archiving which is just a shame on mysql/mariadb

utdrmac commented 8 years ago

If you are just looking for a remote backup, why not skip garbd altogether and use xtrabackup by itself? You still get all of the information from the cluster and you don't have to bother with garbd joining/dejoining over and over. What do you feel you are gaining by using garbd in this manner?

menardorama commented 8 years ago

You can have a central backup server with just garbd installed but not started.

It will avoid having a nfs share on the galera nodes to do backups.

To be honnest I don't like xtrabackup mostly because of the incremental implementation.

It should have been thinked for remote backup, but it's not the case (or the documentation is too poor to be understood) that's why I use garbd.

And using that if you have a correctly sized gcache you have the ability to add a new node without having to do a SST, you only restore the backup.

utdrmac commented 8 years ago

Well, you don't need nfs on the galera nodes just for backups. You are already running xtrabackup, wrappered by garbd and wrappered again by your script. Too many wrappers. I've implemented several "centralized backups" before and it's much simpler than this and no garbd involvement necessary. One script on the backup server ssh's to each host that you want to back up and executes a streaming xtrabackup back to the central host. It's a 1 liner; no garbd, no nfs on each node, no wrappers within wrappers. Your solution, to me, seems like a huge overkill, while accomplishing the same thing.

menardorama commented 8 years ago

Feel free to provide me some examples I would be really happy.

It's just that's the solution proposed on http://galeracluster.com/documentation-webpages/backingupthecluster.html

Seems that nobody are sharing in the Mariadb/mysql world how dbs are backed up '(I mean no mysqldump of course)

utdrmac commented 8 years ago

Plenty of examples here https://www.percona.com/doc/percona-xtrabackup/2.4/howtos/recipes_ibkx_stream.html Just be sure to add --galera-info to your invocation to get the galera info, along with others like --compress --parallel --stream=xbstream etc. These two blog posts may also be of use to you https://www.percona.com/blog/2015/07/16/bypassing-sst-pxc-incremental-backups/ and https://www.percona.com/blog/2012/08/02/avoiding-sst-when-adding-new-percona-xtradb-cluster-node/