MDEV-25855 Added support for Galera replication with cluster auto bootstrapping

tymonx commented 3 years ago

This patch add support for Galera replication. It fixes #28 Support Galera Replication.

Features:

it detects if Galera replication was enabled using mysql configuration files or provided mysqld command line arguments
on default it enables cluster auto bootstrap feature
on default the first cluster node is used for cluster auto bootstrapping based on the wsrep_cluster_address parameter from mysql configuration files, mysqld command line arguments or by setting the WSREP_CLUSTER_ADDRESS environment variable
cluster auto bootstrap feature can be disabled by setting the WSREP_SKIP_AUTO_BOOTSTRAP environment variable
use the WSREP_AUTO_BOOTSTRAP_ADDRESS environment variable to explicitly choice other node for cluster bootstrapping
cluster node hostnames or IP addresses must be valid to enable cluster auto bootstrapping

How to use it.

Prepare mysql configuration file galera.cnf:

[galera]
wsrep_on                       = ON
wsrep_sst_method               = rsync
wsrep_provider                 = /usr/lib/libgalera_smm.so
bind-address                   = 0.0.0.0
binlog_format                  = row
default_storage_engine         = InnoDB
innodb_doublewrite             = 1
innodb_autoinc_lock_mode       = 2
innodb_flush_log_at_trx_commit = 2

Remove write permission for others (it fixes Warning: World-writable config file):

chmod o-w galera.cnf

Prepare Docker Compose file docker-compose.yml:

services:
    node:
        image: mariadb
        restart: always
        environment:
            WSREP_CLUSTER_ADDRESS: "${WSREP_CLUSTER_ADDRESS:-}"
            MYSQL_ROOT_PASSWORD: example
        volumes:
            - ./galera.cnf:/etc/mysql/conf.d/10-galera.cnf:ro,z
        command:
            - --wsrep-cluster-address=gcomm://db_node_1,db_node_2,db_node_3
        deploy:
            replicas: 3

Start Docker Compose:

docker-compose --project-name db up

To start N MariaDB instances using environment variable:

WSREP_CLUSTER_ADDRESS="gcomm://db_node_1,db_node_2,db_node_3,db_node_4,db_node_5"
docker-compose --project-name db up --scale node="$(echo "${WSREP_CLUSTER_ADDRESS}" | tr ',' ' ' | wc -w)"

To start N MariaDB instances using mysql configuration file:

docker-compose --project-name db up --scale node="$(grep -i wsrep_cluster_address <name>.cnf | tr -d ' ' | tr ',' ' ' | wc -w)"

To start N MariaDB instances using POSIX script helper:

#!/usr/bin/env sh

# usage: scale.sh <project-name> <service-name> <scale>
#    ie: scale.sh db node 5

PROJECT_NAME="${1:-db}"
SERVICE_NAME="${2:-node}"
SCALE="${3:-3}"

WSREP_CLUSTER_ADDRESS="gcomm://${PROJECT_NAME}_${SERVICE_NAME}_1"

for i in $(seq 2 "${SCALE}"); do
    WSREP_CLUSTER_ADDRESS="${WSREP_CLUSTER_ADDRESS},${PROJECT_NAME}_${SERVICE_NAME}_${i}"
done

docker-compose --project-name "${PROJECT_NAME}" up --scale "${SERVICE_NAME}"="${SCALE}"

Example usage:

./scale.sh db node 5

julienfritsch44 commented 3 years ago

@janlindstrom do you think you can review this, please?

janlindstrom commented 3 years ago

I must say I do not know much about docker but changes do look reasonable.

grooverdan commented 3 years ago

Thanks @janlindstrom.

@tymonx sorry I've been so slow, I am progressing. I've been podman{,-compose} testing being a userspace only limits some for the things like unique IP addresses per node (probably will have a way eventually), and I've been reacquainting myself with galera and compose to ensure that its the right design.

I'm pretty happy so far. Just been composing test cases.

Success:

detection of volume state and the initialization

Not Yet (to be fixed eventually):

ports on the cluster address should be ignored (very small change to docker_address_match).

What was the rational behind the order in: docker_ip_match "$resolved" || docker_ip_match "$1" || docker_hostname_match "$resolved" || docker_hostname_match "$1" ? Wouldn't you take direct $1 matches before a resolution?

ChristianCiach commented 3 years ago

Hi @tymonx! Thank you for doing this! We are currently evaluating bitnami/mariadb-galera, but we are seeing quite a lot of bugs. Some of these bugs happen because this image is not designed for host-networking --network host and using IP addresses instead of hostnames for the wsrep-cluster-address (even though this is recommended by the galera documentation).

Please make sure that your PR also works in these cases.

Also, you may want to provide an option to force a container into bootstrap mode. When the whole cluster crashes, it may happen that no node is safe_to_bootstrap. When this happens, one node must be forced to bootstrap. On native mariadb installations, you would just run mysqld --wsrep-new-cluster again after editing the grastate.dat to set set_to_bootstrap=1. The Bitnami-image image provides the environment variable MARIADB_GALERA_FORCE_SAFETOBOOTSTRAP (see https://github.com/bitnami/bitnami-docker-mariadb-galera/blob/3b93659e7d0647a5bf3810cc204d71d834120266/10.5/debian-10/rootfs/opt/bitnami/scripts/libmariadbgalera.sh#L99).

But after thinking about this for a minute, this is probably not necessary here, because the user could just pass --wsrep-new-cluster as a command to docker run, right? (This is not possible when using the Bitnami image, which is probably why they invented the environment variable).

ChristianCiach commented 3 years ago

It would be nice if you could provide a way to force a node into bootstrap mode just once. In case of a cluster crash, I want a node to force-bootstrap just once to repair the cluster. But when I do docker restart when the cluster is working again, I don't want the container to force-bootstrap again.

Edit: I have no idea how this could be archived...

grooverdan commented 3 years ago

@ChristianCiach thanks for your interest and describing the requirements/use cases. The number of variants is what is taking this so long to review. While the aim is not to be comprehensive on the first functionality I do aim to use an implementation that needs will be stable.

Yes --wsrep-new-cluster can be passed as an argument as a force option, but like what you mentioned on restart this isn't desired, so a different option/variable is needed.

I'm going to consider this bootstrap first, and then recovery as the next step.

ChristianCiach commented 3 years ago

Bitnami's MARIADB_GALERA_FORCE_SAFETOBOOTSTRAP has the same issue, as it also doesn't remove itself. When using this environment variable, you have to remember to re-deploy the container without this variable after the cluster has recovered.

tymonx commented 3 years ago

I'm back :)

ports on the cluster address should be ignored (very small change to docker_address_match).

Fixed. I have also added line for striping cluster addresses options ?option1=value1[&option2=value2] :

# it removes URI schemes like gcomm://
address="${address#[[:graph:]]*://}"

# it removes port suffix per address
address="${address/:[0-9]*//}"

# it removes options suffix ?option1=value1[&option2=value2]
address="${address%\?[[:graph:]]*}"

What was the rational behind the order in: docker_ip_match "$resolved" || docker_ip_match "$1" || docker_hostname_match "$resolved" || docker_hostname_match "$1" ? Wouldn't you take direct $1 matches before a resolution?

I have just randomly hitting on my keyboard. No specific reasons. I have already changed order, first hostnames.

I have added new changes after some intense testing on various environments, Docker Compose, Docker Swarm, QEMU, Fedora CoreOS, with/without virtualization or physical machines.

DNS resolve lookups for IP -> hostname and hostname -> IP. This will allow to correctly match IP address or hostname node.

Reasons:

Docker Compose/Swarm creates implicitly two hostnames <service-name>-<id>.<network-name> and random hash. This will allow to match with <service-name>-<id>.<network-name> or <service-name>-<id>
Virtual machines like QEMU hides guest (container with MariaDB) in own network with own IP. It is possible to set hostname from Compose/Swarm like this -netdev user,id=<name>,hostname=$(hostname) -device virtio-net,netdev=<name> and use <service-name>-<id>.<network-name> or <service-name>-<id>
The machine hostname can have any name that is not reachable from network. DNS reverse lookup resolves that

I have fixed YAML example in PR description. Proper SELinux label should be :ro,z not :ro,Z Configure the selinux label

To Do:

Checking the $wsrepdir/gvwstate.dat file is not enough. On graceful container shutdown this file is removed by the MariaDB daemon. This will cause to run bootstrapping again. I'm currently looking into that to improve this.

ChristianCiach commented 3 years ago

To be honest, I don't fully trust your ip/hostname detection logic. There are too many "but what if"s. For example, what happens if the machine has multiple network devices and the container is deployed using "host networking"? Also, I've seen many environments where dns reverse lookup is just not possible.

I would like to be able to explicitly define the node address of the current container. For example, if wsrep_cluster_address is gcomm://172.28.180.96,172.28.180.97,172.28.180.98, I would like to be able to explicitly define the node address of the second node to 172.28.180.97. If you already know the node address of the current node, there is no need to guess anymore. In fact, I already do pass the node address to the container using --wsrep_node_address.

tymonx commented 3 years ago

@ChristianCiach no problem, I can add a comparison with the wsrep-node-address value.

It depends on user needs. For example wsrep-node-address is useless when someone is using replicas or global mode. Because it requires to somehow set the wsrep-node-address per each created container.

ChristianCiach commented 3 years ago

Yes, of course, I agree with you :) It is not always possible to have different configurations for each node. For example, if you want to scale your cluster up/down dynamically (for example using Docker Swarm services or Kubernetes StatefulSet), then it is very hard or even impossible to set wsrep-node-address.

I think it would be awesome if you could at least look at wsrep-node-address if it is set, just like you said! Also, please support both cases, where wsrep-node-address is defined inside a .cnf file or passed as a command by using --wsrep-node-address.

Again, thank you so much for doing this. It already looks very promising!.

tymonx commented 3 years ago

I think it would be awesome if you could at least look at wsrep-node-address if it is set, just like you said! Also, please support both cases, where wsrep-node-address is defined inside a .cnf file or passed as a command by using --wsrep-node-address

Sure. It is very reasonable to do that. I was thinking about the same.

tymonx commented 3 years ago

@ChristianCiach I have already added support for the --wsrep-node-address.

When someone will provide the wsrep-node-address from configuration files or command line it will skip auto Docker address match mechanism to select proper node for bootstrapping. On default it compares to the first value from the wsrep-cluster-address. To choice other node, use the WSREP_AUTO_BOOTSTRAP_ADDRESS environment variable.

grooverdan commented 3 years ago

Just to share some rough stuff I've been looking at (that covers other galera options) and needing to reread the above:

diff --git a/docker-entrypoint.sh b/docker-entrypoint.sh
index 1b10dc2..e51dc02 100755
--- a/docker-entrypoint.sh
+++ b/docker-entrypoint.sh
@@ -359,7 +359,25 @@ docker_ip_match() {
 #    ie: docker_address_match node1
 # it returns true if provided value match with container IP address or container hostname. Otherwise it returns false
 docker_address_match() {
-       local resolved="$(resolveip --silent "$1" 2>/dev/null)" # it converts hostname to ip or vice versa
+       local host=${1%%:*}
+       local port=${1#*:}
+       if [ -n "$port" ]; then
+               local wsrep_provider_options="$(mysql_get_config wsrep_provider_options)"
+               wsrep_provider_options=( ${wsrep_provider_options//,/ } )
+               for opt in "${wsrep_provider_options=[@]}"; do
+                       if [[ "$opt" =~ gmcast.listen_addr.* ]]; then
+                               local val="${opt#*=[[:graph:]]*://}"
+                               case "$val" in
+                                       ${host}:${port})        return 1 ;;
+                                       0.0.0.0:${port})        break ;;
+                                       *:${port})              break ;;
+                                       *)                      return 0;;
+                               esac
+                       fi
+               done
+
+       fi
+       local resolved="$(resolveip --silent "$host" 2>/dev/null)" # it converts hostname to ip or vice versa

        docker_ip_match "$resolved" || docker_ip_match "$1" || docker_hostname_match "$resolved" || docker_hostname_match "$1"
 }

As a crude hack with:

#!/bin/bash
podman pod stop db && podman pod rm db
podman pod create --name=db  --share net
for n in 1 2 3
do
    podman create --name=db_node_$n --pod=db \
            --security-opt label=disable --label io.podman.compose.config-hash=123 --label io.podman.compose.project=db --label io.podman.compose.version=0.0.1 --label com.doc
ker.compose.container-number=$n --label com.docker.compose.service=node \
        -e MARIADB_ROOT_PASSWORD=example \
        --add-host node:127.0.0.1 --add-host db_node_1:127.0.0.1 --add-host db_node_2:127.0.0.1 --add-host db_node_3:127.0.0.1 \
        --restart always \
        mariadb:testgalera --port $(( 3306 - 1 + $n )) --wsrep_cluster_address=gcomm://db_node_1:4567,db_node_2:4577,db_node_3:4587 --wsrep-node-address=127.0.0.1 --wsrep_
provider_options="gmcast.listen_addr=tcp://0.0.0.0:$(( 4567 + ( $n - 1 ) * 10 ))" --wsrep-on=1 --wsrep-provider=/usr/lib/libgalera_smm.so --binlog_format=ROW
done

Is there a point at which the autobootstrap is (always?) applied if you are actually starting from an empty datadir? Anything else is recovery.

Should non-first nodes not initialize with /docker-entrypoint-initdb.d/ (and rely on galera sst)?

tymonx commented 3 years ago

Is there a point at which the autobootstrap is (always?) applied if you are actually starting from an empty datadir?

Docker Daemon (I don't know about Podman) always creates a volume for container. If container stops and starts again (including restarting), files are still present. Bootstrapping will not fire.

I have also tested and confirmed that graceful shutdown docker --kill SIGTERM <container> the mysqld daemon will remove the gvwstate.dat file.

I'm looking into more proper solution to handle this.

tymonx commented 3 years ago

For Podman I cannot simple strip port numbers from wsrep-cluster-address. It should be also included for comparison. Because Podman works on 127.0.0.1 vs Docker that always creates container with own IP address.

tymonx commented 3 years ago

Working Podman example script to start N containers in db pod for commit 45149e22dad93d10d671bdd4cc727c405d43817e:

#!/usr/bin/env sh

NODES="${1:-3}"

options="--add-host db_node_1:127.0.0.1"
address="db_node_1:4567"

for i in $(seq 2 "${NODES}"); do
    options="${options} --add-host db_node_$i:127.0.0.1"
    address="${address},db_node_$i:$(( 4567 + ( $i - 1 ) * 10 ))"
done

podman pod stop db
podman pod rm db
podman pod create --name=db --share net

for i in $(seq 1 "${NODES}"); do
    podman create \
        --pod=db \
        --name=db_node_$i \
        --security-opt label=disable \
        --env MARIADB_ROOT_PASSWORD=example \
        --restart always \
        ${options:+${options}} \
        mariadb:dev \
        --port $(( 3305 + $i )) \
        --wsrep_cluster_address="gcomm://${address}" \
        --wsrep-node-address="db_node_$i:$(( 4567 + ( $i - 1 ) * 10 ))" \
        --wsrep-on=on \
        --wsrep-provider=/usr/lib/libgalera_smm.so \
        --binlog_format=row
done

podman pod start db

View logs:

podman logs --follow db_node_1

Output:

View:
  id: b98b33bc-d845-11eb-99df-0245217d5d15:2
  status: primary
  protocol_version: 4
  capabilities: MULTI-MASTER, CERTIFICATION, PARALLEL_APPLYING, REPLAY, ISOLATION, PAUSE, CAUSAL_READ, INCREMENTAL_WS, UNORDERED, PREORDERED, STREAMING, NBO
  final: no
  own_index: 2
  members(3):
        0: b988176f-d845-11eb-994b-576602aed1c3, c38f5df9273f
        1: b98a6f95-d845-11eb-a223-e6bc7724aaf5, 7268f0eb1373
        2: b98aa0de-d845-11eb-945b-e21ff9f0f09e, 81f7c1cfefbe

tymonx commented 3 years ago

Added support for the safe_to_bootstrap from the grastate.dat file. This will work in case of graceful shutdown of all nodes but step-by-step. Galera writes 1 to the last gracefully shutdown node.

For Docker Compose users after docker-compose up they should call manually docker stop db_node_<n> per each node. Invoking the docker-compose stop command or hitting CTRL + C combination on the keyboard will gracefully shutdown all nodes at the same time and Galera cannot handle this properly.

grooverdan commented 2 years ago

I've based and squashed the commits up. Shell check changed a few things. As a basic bootstrap its ok. I'm still looking at what crash recovery would look like. Probably need to make our own state transition diagram.

https://galeracluster.com/library/documentation/crash-recovery.html

grooverdan commented 2 years ago

@ChristianCiach et all. I welcome any summary of the test cases needed. MDEV-25855 (preferred) or here. I have looked though the bitnami galera issue referenced above, and the blog from which I'll derive some cases too.

jozefrebjak commented 2 years ago

Hello, any news with this PR ?

MariaDB / mariadb-docker

MDEV-25855 Added support for Galera replication with cluster auto bootstrapping #377