colinmollenhour / mariadb-galera-swarm

MariaDb Galera Cluster container based on official mariadb image which can auto-bootstrap and recover cluster state.
https://hub.docker.com/r/colinmollenhour/mariadb-galera-swarm
Apache License 2.0
217 stars 102 forks source link

Cluster fails to start #36

Closed cpjolly closed 6 years ago

cpjolly commented 6 years ago

I have a Galera Cluster that has been running without problem as part of a Docker Stack on a 4 server Docker Swarm in production for the past 8 months.

This week, for unrelated reasons, I had to restart the Docker Stack from scratch, and the Galera Cluster failed to start.

To test, I created a brand new three host Docker swarm and have confirmed that neither the "Docker 1.12 Swarm Mode (cli)" example nor the "Docker 1.13 Swarm Mode (stack)" work as expected.

No matter what changes I make, I can find no way to successfully start a 3-node cluster.

For example, using the "Docker 1.13 Swarm Mode (stack)" example, with the current "colinmollenhour/mariadb-galera-swarm" (10.2) image, the galera_seed service starts fine, but the two galera_nodes just endlessly try to start, fail, then restart... See attached log file. Sometimes I can get the two galera_nodes to start and sync, but then they fail when I stop the seed.

t.log

colinmollenhour commented 6 years ago

The Failed to read 'ready <addr>' from: ... error is the same as what I got on 10.1.31 before applying the patches so I think this might be a bug in the xtrabackup-v2 shell script. I'm a little surprised it is not fixed yet in 10.2 if that's the case. (I say a little only because my expectations for MariaDb stability and speed of releasing critical fixes has been shattered this year).

I'd go back to 10.1 for now. I haven't used 10.2 myself yet, only released it because others said it worked and were requesting it. The official 10.1.31 has completely broken SST with xtrabackup-v2 but this repo contains two patches that fix it. Amazing that they have not bothered to release a 10.1.32 yet to fix this themselves.

If you really want to use 10.2 I'd first verify that the xtrabackup-v2 script in the image contains this patch: https://github.com/colinmollenhour/mariadb-galera-swarm/blob/master/mdev-15254.patch

cpjolly commented 6 years ago

Hi @colinmollenhour

I ran some more tests with 10.1 but that also had similar issues and doesn't stabilise, so this problem is not just with 10.2.

Interesting to hear about your concerns with MariaDB stability etc. In your opinion, is something like Percona XtraDB Cluster a better alternative ? As far as I can tell, they suffer from similar issues around cluster stability.

Many thanks

colinmollenhour commented 6 years ago

The patches are already applied by Dockerfile-10.1 so if you use those tags there is nothing to do.

I haven't looked into Percona enough to see if it is any better but that would be the next one I would try probably. That, or MySQL + Galera plugin.