colinmollenhour / mariadb-galera-swarm

MariaDb Galera Cluster container based on official mariadb image which can auto-bootstrap and recover cluster state.
https://hub.docker.com/r/colinmollenhour/mariadb-galera-swarm
Apache License 2.0
217 stars 102 forks source link

Docker stack fails to start nodes if stack name not galera #20

Closed cpjolly closed 7 years ago

cpjolly commented 7 years ago

If I edit docker compose as follows

services: seed: ... node: ... command: node test_seed,test_node

and then do docker stack deploy -c docker-compose.yml test The seed starts OK but when starting the nodes by scaling them to 2, they fail.

First there are multiple [Warning] WSREP: Member 0.0 (xxxxx) requested state transfer from 'any', but it is impossible to select State Transfer donor: Resource temporarily unavailable ... And finally WSREP_SST: [ERROR] Possible timeout in receving first data from donor in gtid stage

If I change back to galera, restart the stack and scale the nodes to 2, I also see a few

[Warning] WSREP: Member 0.0 (xxxxx) requested state transfer from 'any', but it is impossible to select State Transfer donor: Resource temporarily unavailable

But then it seems to works.

Is galera defined as a fallback or a variable somewhere ?

colinmollenhour commented 7 years ago

No, I have no idea why your first example would not work the same as the one in the repo... If it gets all the way to the WSREP part in the logs then it is already past DNS resolution so I doubt that is the real issue.

cpjolly commented 7 years ago

I think I understand the problem and am looking at how to create a fix.

The issue is that start.sh, assumes the seed and the node containers are connected to the stack default overlay network ("stackname_default", e.g. "galera_default" or "test_default") via the eth0 interface. (see line 65 and 68)

If the stack name is "galera", this is true , BUT if the stack name is "test", this is NOT true. In that case, the default overlay network is connected to the eth2 interface on the seed and node containers.

I can see this is the case because in the container log file for seed it says Got NODE_ADDRESS=10.255.0.6

10.255.0.6 is an IP address on the swarm "ingress" overlay network, not the default overlay "test_default" network.

If the name of the stack is "test", then the ingress network is connected to eth0, not the default overlay network.

Presumably Docker assigns overlay networks to interfaces alphabetically, so "ingress" comes after "galera_default" but before "test_default", which means if the stack is called "galera", then "galera_default" is assigned to eth0, but if the stack is called "test" then "ingress" is assigned to eth0.

Probably the resolution is to add another overlay network in the docker-compose file with a specific subnet address, and to search for interfaces connected to that specific subnet in start.sh rather than eth0

bgou commented 7 years ago

I ran into a similar issue not long ago, and it was fixed by making sure both are on user created network.

Note: Service discovery will only work if your services are attached to a user-created overlay network (see top of this article). When a swarm is initialized, an ingress network is created if it does not exist. This network is not used by containers directly, but to enable the routing mesh functionality in swarm mode.

See https://docs.docker.com/engine/swarm/networking/

On Wed, May 10, 2017 at 12:14 PM cpjolly notifications@github.com wrote:

I think I understand the problem and am looking at how to create a fix.

The issue is that start.sh, assumes the seed and the node containers are connected to the stack default overlay network ("_default", e.g. "galera_default" or "test_default") via the eth0 interface. (see line 65 and 68)

If the stack name is "galera", this is true , BUT if the stack name is "test", this is NOT true. In that case, the default overlay network is connected to the eth2 interface on the seed and node containers.

I can see this is the case because in the container log file for seed it says Got NODE_ADDRESS=10.255.0.6

10.255.0.6 is an IP address on the swarm "ingress" overlay network, not the default overlay "test_default" network.

If the name of the stack is "test", then the ingress network is connected to eth0, not the default overlay network.

Probably the resolution is to add another overlay network in the docker-compose file with a specific subnet address, and to search for interfaces connected to that specific subnet in start.sh rather than eth0

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/colinmollenhour/mariadb-galera-swarm/issues/20#issuecomment-300584744, or mute the thread https://github.com/notifications/unsubscribe-auth/AAoYBxsv0bF6Xh8_rxxQoFFogFmt7JpOks5r4gyygaJpZM4NWbj7 .

cpjolly commented 7 years ago

Hi @bgou - I don't think that's the issue here.

When you use a docker-compose stack, it auto-generates a default overlay network called "stackname_default" and attaches all containers in the stack to that network, so there is normally no need to add a user-defined overlay network.

The containers and services in the stack have the correct network connectivity and DNS discoverability. The problem is that the start.sh script makes an assumption that the stack default overlay network is connected to eth0, which is not true if the stack is called "test".

colinmollenhour commented 7 years ago

Probably the resolution is to add another overlay network in the docker-compose file with a specific subnet address, and to search for interfaces connected to that specific subnet in start.sh rather than eth0

That sounds like a good solution and it should already be supported since NODE_ADDRESS can be a grep -e compatible regex. This isn't tested though..

cpjolly commented 7 years ago

For the grep -e with for example "^10.0.0.*" to work, we need to make a minor change to start.sh

on line 71 getent hosts $(hostname) returns the ip address followed by the hostname.

We only want the ip address, so we need to add " | awk '{print $1}' " to line 71

I'm sure there is a clever awk expression that removes the need for the grep, but this additional awk is the simplest change.

I am preparing a pull request for this change