colinmollenhour / mariadb-galera-swarm

MariaDb Galera Cluster container based on official mariadb image which can auto-bootstrap and recover cluster state.
https://hub.docker.com/r/colinmollenhour/mariadb-galera-swarm
Apache License 2.0
216 stars 101 forks source link

Fail to start in Rancher1.6-Swarm1.13 #68

Closed zqcthegreat closed 5 years ago

zqcthegreat commented 5 years ago

Seed runs, but node In Rancher1.6 , I added a Stack named 'SQL',and then added a service named 'galera-seed', configued following README and it works. Then I added a service named 'galera-node', then something goes wrong.

Log in 'galera-node' shows like this : [Warning] WSREP: Could not open state file for reading: '/var/lib/mysql//grastate.dat' [Warning] WSREP: access file(/var/lib/mysql//gvwstate.dat) failed(No such file or directory) [Warning] WSREP: Gap in state sequence. Need state transfer. [Warning] WSREP: Failed to prepare for incremental state transfer: Local state UUID (00000000-0000-0000-0000-000000000000) does not match group state UUID (9e6e4838-4104-11e9-9c61-6a606ae7d65a): 1 (Operation not permitted) [Warning] WSREP: Member 0.0 (SQL-galera-node-1) requested state transfer from 'any', but it is impossible to select State Transfer donor: Resource temporarily unavailable [Note] WSREP: (aa73f9d6, 'tcp://0.0.0.0:4567') connection to peer aa73f9d6 with addr tcp://10.42.80.180:4567 timed out, no messages seen in PT3S [Warning] WSREP: 2.0 (SQL-galera-seed-1): State transfer to 1.0 (SQL-galera-node-2) failed: -22 (Invalid argument) [ERROR] WSREP: gcs/src/gcs_group.cpp:gcs_group_handle_join_msg():737: Will never receive state. Need to abort. /usr/local/bin/start.sh: line 416: 41 Aborted gosu mysql mysqld.sh --console $MYSQL_MODE_ARGS --wsrep_cluster_name=$CLUSTER_NAME --wsrep_cluster_address=gcomm://$GCOMM --wsrep_node_address=$NODE_ADDRESS:4567 --default-time-zone=$DEFAULT_TIME_ZONE "$@" 2>&1 MariaDB exited with return code (0)

After removing 'seed' and adding 'node', the cluster fail to run. 1 of the 3 nodes shows "Access denied for user 'system'@'127.0.0.1' (using password: YES)" , another shows " socat[611] E connect(7, AF=2 10.42.177.98:3309, 16): Connection refused ", the other shows "MariaDB exited with return code (0)" , endlessly.

Could you help me debug this? Grateful for that!

colinmollenhour commented 5 years ago

Hi, I don't offer user support on github, the Issues are only for reporting bugs or proposing improvements. In this case it appears there is a networking issue:

Note] WSREP: (aa73f9d6, 'tcp://0.0.0.0:4567') connection to peer aa73f9d6 with addr tcp://10.42.80.180:4567 timed out, no messages seen in PT3S

It could also be a configuration issue if 10.42.80.180 is not the correct interface for your cluster to communicate on. Make sure that two containers can communicate over the required ports for standard Galera function in addition to 3309 for the cluster recovery chatter.

GimpMaster commented 4 years ago

Sorry to bring up this closed issue. However I'm seeing the same error. Why would the "cluster recovery chatter" talk across port 3309? I don't see 3309 listed on any of galera's firewall port lists to expose.

To be honest I don't even see it in the EXPOSE of your Dockerfile. But I'm seeing the exact same issue when doing manual docker runs (not swarm).

colinmollenhour commented 4 years ago

It's not Galera, it is specific to this Docker image: https://github.com/colinmollenhour/mariadb-galera-swarm/search?q=3309&unscoped_q=3309

Using "EXPOSE" is not required, it is just a hint:

The EXPOSE instruction does not actually publish the port. It functions as a type of documentation between the person who builds the image and the person who runs the container, about which ports are intended to be published.

The port should not be exposed to the outside world, but the containers do need to be able to communicate with each other over this port.

GimpMaster commented 4 years ago

Thank you very much for the insight. I've got it working now. The issue for me specifically was the NODE_ADDRESS was taken from eth0 which was on a bridge network. So when it bound to the address for socat it used the docker bridge and would never accept any incoming packets on the host eth0.

I'm still a little new to Docker so I wanted to prove I could get these running manually on 3 nodes. I'll try Docker Swarm next.

By the way....thank you for the great container!!!