Open tymonx opened 3 years ago
@janlindstrom do you think you can review this, please?
I must say I do not know much about docker but changes do look reasonable.
Thanks @janlindstrom.
@tymonx sorry I've been so slow, I am progressing. I've been podman{,-compose} testing being a userspace only limits some for the things like unique IP addresses per node (probably will have a way eventually), and I've been reacquainting myself with galera and compose to ensure that its the right design.
I'm pretty happy so far. Just been composing test cases.
Success:
Not Yet (to be fixed eventually):
What was the rational behind the order in: docker_ip_match "$resolved" || docker_ip_match "$1" || docker_hostname_match "$resolved" || docker_hostname_match "$1"
? Wouldn't you take direct $1
matches before a resolution?
Hi @tymonx! Thank you for doing this! We are currently evaluating bitnami/mariadb-galera, but we are seeing quite a lot of bugs. Some of these bugs happen because this image is not designed for host-networking --network host
and using IP addresses instead of hostnames for the wsrep-cluster-address (even though this is recommended by the galera documentation).
Please make sure that your PR also works in these cases.
Also, you may want to provide an option to force a container into bootstrap mode. When the whole cluster crashes, it may happen that no node is safe_to_bootstrap
. When this happens, one node must be forced to bootstrap. On native mariadb installations, you would just run mysqld --wsrep-new-cluster
again after editing the grastate.dat
to set set_to_bootstrap=1
. The Bitnami-image image provides the environment variable MARIADB_GALERA_FORCE_SAFETOBOOTSTRAP
(see https://github.com/bitnami/bitnami-docker-mariadb-galera/blob/3b93659e7d0647a5bf3810cc204d71d834120266/10.5/debian-10/rootfs/opt/bitnami/scripts/libmariadbgalera.sh#L99).
But after thinking about this for a minute, this is probably not necessary here, because the user could just pass --wsrep-new-cluster
as a command to docker run
, right? (This is not possible when using the Bitnami image, which is probably why they invented the environment variable).
It would be nice if you could provide a way to force a node into bootstrap mode just once. In case of a cluster crash, I want a node to force-bootstrap just once to repair the cluster. But when I do docker restart
when the cluster is working again, I don't want the container to force-bootstrap again.
Edit: I have no idea how this could be archived...
@ChristianCiach thanks for your interest and describing the requirements/use cases. The number of variants is what is taking this so long to review. While the aim is not to be comprehensive on the first functionality I do aim to use an implementation that needs will be stable.
Yes --wsrep-new-cluster
can be passed as an argument as a force option, but like what you mentioned on restart
this isn't desired, so a different option/variable is needed.
I'm going to consider this bootstrap first, and then recovery as the next step.
Bitnami's MARIADB_GALERA_FORCE_SAFETOBOOTSTRAP
has the same issue, as it also doesn't remove itself. When using this environment variable, you have to remember to re-deploy the container without this variable after the cluster has recovered.
I'm back :)
- ports on the cluster address should be ignored (very small change to docker_address_match).
Fixed. I have also added line for striping cluster addresses options ?option1=value1[&option2=value2]
:
# it removes URI schemes like gcomm://
address="${address#[[:graph:]]*://}"
# it removes port suffix per address
address="${address/:[0-9]*//}"
# it removes options suffix ?option1=value1[&option2=value2]
address="${address%\?[[:graph:]]*}"
What was the rational behind the order in:
docker_ip_match "$resolved" || docker_ip_match "$1" || docker_hostname_match "$resolved" || docker_hostname_match "$1"
? Wouldn't you take direct$1
matches before a resolution?
I have just randomly hitting on my keyboard. No specific reasons. I have already changed order, first hostnames.
I have added new changes after some intense testing on various environments, Docker Compose, Docker Swarm, QEMU, Fedora CoreOS, with/without virtualization or physical machines.
IP -> hostname
and hostname -> IP
. This will allow to correctly match IP address or hostname node.Reasons:
<service-name>-<id>.<network-name>
and random hash. This will allow to match with <service-name>-<id>.<network-name>
or <service-name>-<id>
-netdev user,id=<name>,hostname=$(hostname) -device virtio-net,netdev=<name>
and use <service-name>-<id>.<network-name>
or <service-name>-<id>
hostname
can have any name that is not reachable from network. DNS reverse lookup resolves that:ro,z
not :ro,Z
Configure the selinux labelTo Do:
$wsrepdir/gvwstate.dat
file is not enough. On graceful container shutdown this file is removed by the MariaDB daemon. This will cause to run bootstrapping again. I'm currently looking into that to improve this.To be honest, I don't fully trust your ip/hostname detection logic. There are too many "but what if"s. For example, what happens if the machine has multiple network devices and the container is deployed using "host networking"? Also, I've seen many environments where dns reverse lookup is just not possible.
I would like to be able to explicitly define the node address of the current container. For example, if wsrep_cluster_address
is gcomm://172.28.180.96,172.28.180.97,172.28.180.98
, I would like to be able to explicitly define the node address of the second node to 172.28.180.97
. If you already know the node address of the current node, there is no need to guess anymore. In fact, I already do pass the node address to the container using --wsrep_node_address
.
@ChristianCiach no problem, I can add a comparison with the wsrep-node-address
value.
It depends on user needs. For example wsrep-node-address
is useless when someone is using replicas or global mode. Because it requires to somehow set the wsrep-node-address
per each created container.
Yes, of course, I agree with you :) It is not always possible to have different configurations for each node. For example, if you want to scale your cluster up/down dynamically (for example using Docker Swarm services or Kubernetes StatefulSet), then it is very hard or even impossible to set wsrep-node-address
.
I think it would be awesome if you could at least look at wsrep-node-address
if it is set, just like you said! Also, please support both cases, where wsrep-node-address
is defined inside a .cnf
file or passed as a command by using --wsrep-node-address
.
Again, thank you so much for doing this. It already looks very promising!.
I think it would be awesome if you could at least look at
wsrep-node-address
if it is set, just like you said! Also, please support both cases, wherewsrep-node-address
is defined inside a.cnf
file or passed as a command by using--wsrep-node-address
Sure. It is very reasonable to do that. I was thinking about the same.
@ChristianCiach I have already added support for the --wsrep-node-address
.
When someone will provide the wsrep-node-address
from configuration files or command line it will skip auto Docker address match mechanism to select proper node for bootstrapping. On default it compares to the first value from the wsrep-cluster-address
. To choice other node, use the WSREP_AUTO_BOOTSTRAP_ADDRESS
environment variable.
Just to share some rough stuff I've been looking at (that covers other galera options) and needing to reread the above:
diff --git a/docker-entrypoint.sh b/docker-entrypoint.sh
index 1b10dc2..e51dc02 100755
--- a/docker-entrypoint.sh
+++ b/docker-entrypoint.sh
@@ -359,7 +359,25 @@ docker_ip_match() {
# ie: docker_address_match node1
# it returns true if provided value match with container IP address or container hostname. Otherwise it returns false
docker_address_match() {
- local resolved="$(resolveip --silent "$1" 2>/dev/null)" # it converts hostname to ip or vice versa
+ local host=${1%%:*}
+ local port=${1#*:}
+ if [ -n "$port" ]; then
+ local wsrep_provider_options="$(mysql_get_config wsrep_provider_options)"
+ wsrep_provider_options=( ${wsrep_provider_options//,/ } )
+ for opt in "${wsrep_provider_options=[@]}"; do
+ if [[ "$opt" =~ gmcast.listen_addr.* ]]; then
+ local val="${opt#*=[[:graph:]]*://}"
+ case "$val" in
+ ${host}:${port}) return 1 ;;
+ 0.0.0.0:${port}) break ;;
+ *:${port}) break ;;
+ *) return 0;;
+ esac
+ fi
+ done
+
+ fi
+ local resolved="$(resolveip --silent "$host" 2>/dev/null)" # it converts hostname to ip or vice versa
docker_ip_match "$resolved" || docker_ip_match "$1" || docker_hostname_match "$resolved" || docker_hostname_match "$1"
}
As a crude hack with:
#!/bin/bash
podman pod stop db && podman pod rm db
podman pod create --name=db --share net
for n in 1 2 3
do
podman create --name=db_node_$n --pod=db \
--security-opt label=disable --label io.podman.compose.config-hash=123 --label io.podman.compose.project=db --label io.podman.compose.version=0.0.1 --label com.doc
ker.compose.container-number=$n --label com.docker.compose.service=node \
-e MARIADB_ROOT_PASSWORD=example \
--add-host node:127.0.0.1 --add-host db_node_1:127.0.0.1 --add-host db_node_2:127.0.0.1 --add-host db_node_3:127.0.0.1 \
--restart always \
mariadb:testgalera --port $(( 3306 - 1 + $n )) --wsrep_cluster_address=gcomm://db_node_1:4567,db_node_2:4577,db_node_3:4587 --wsrep-node-address=127.0.0.1 --wsrep_
provider_options="gmcast.listen_addr=tcp://0.0.0.0:$(( 4567 + ( $n - 1 ) * 10 ))" --wsrep-on=1 --wsrep-provider=/usr/lib/libgalera_smm.so --binlog_format=ROW
done
Is there a point at which the autobootstrap is (always?) applied if you are actually starting from an empty datadir? Anything else is recovery.
Should non-first nodes not initialize with /docker-entrypoint-initdb.d/ (and rely on galera sst)?
Is there a point at which the autobootstrap is (always?) applied if you are actually starting from an empty datadir?
Docker Daemon (I don't know about Podman) always creates a volume for container. If container stops and starts again (including restarting), files are still present. Bootstrapping will not fire.
I have also tested and confirmed that graceful shutdown docker --kill SIGTERM <container>
the mysqld
daemon will remove the gvwstate.dat
file.
I'm looking into more proper solution to handle this.
For Podman I cannot simple strip port numbers from wsrep-cluster-address
. It should be also included for comparison. Because Podman works on 127.0.0.1
vs Docker that always creates container with own IP address.
Working Podman example script to start N containers in db pod for commit 45149e22dad93d10d671bdd4cc727c405d43817e:
#!/usr/bin/env sh
NODES="${1:-3}"
options="--add-host db_node_1:127.0.0.1"
address="db_node_1:4567"
for i in $(seq 2 "${NODES}"); do
options="${options} --add-host db_node_$i:127.0.0.1"
address="${address},db_node_$i:$(( 4567 + ( $i - 1 ) * 10 ))"
done
podman pod stop db
podman pod rm db
podman pod create --name=db --share net
for i in $(seq 1 "${NODES}"); do
podman create \
--pod=db \
--name=db_node_$i \
--security-opt label=disable \
--env MARIADB_ROOT_PASSWORD=example \
--restart always \
${options:+${options}} \
mariadb:dev \
--port $(( 3305 + $i )) \
--wsrep_cluster_address="gcomm://${address}" \
--wsrep-node-address="db_node_$i:$(( 4567 + ( $i - 1 ) * 10 ))" \
--wsrep-on=on \
--wsrep-provider=/usr/lib/libgalera_smm.so \
--binlog_format=row
done
podman pod start db
View logs:
podman logs --follow db_node_1
Output:
View:
id: b98b33bc-d845-11eb-99df-0245217d5d15:2
status: primary
protocol_version: 4
capabilities: MULTI-MASTER, CERTIFICATION, PARALLEL_APPLYING, REPLAY, ISOLATION, PAUSE, CAUSAL_READ, INCREMENTAL_WS, UNORDERED, PREORDERED, STREAMING, NBO
final: no
own_index: 2
members(3):
0: b988176f-d845-11eb-994b-576602aed1c3, c38f5df9273f
1: b98a6f95-d845-11eb-a223-e6bc7724aaf5, 7268f0eb1373
2: b98aa0de-d845-11eb-945b-e21ff9f0f09e, 81f7c1cfefbe
Added support for the safe_to_bootstrap
from the grastate.dat
file. This will work in case of graceful shutdown of all nodes but step-by-step. Galera writes 1
to the last gracefully shutdown node.
For Docker Compose users after docker-compose up
they should call manually docker stop db_node_<n>
per each node. Invoking the docker-compose stop
command or hitting CTRL + C
combination on the keyboard will gracefully shutdown all nodes at the same time and Galera cannot handle this properly.
I've based and squashed the commits up. Shell check changed a few things. As a basic bootstrap its ok. I'm still looking at what crash recovery would look like. Probably need to make our own state transition diagram.
https://galeracluster.com/library/documentation/crash-recovery.html
@ChristianCiach et all. I welcome any summary of the test cases needed. MDEV-25855 (preferred) or here. I have looked though the bitnami galera issue referenced above, and the blog from which I'll derive some cases too.
Hello, any news with this PR ?
This patch add support for Galera replication. It fixes #28 Support Galera Replication.
Features:
mysql
configuration files or providedmysqld
command line argumentsmysql
configuration files,mysqld
command line arguments or by setting theWSREP_CLUSTER_ADDRESS
environment variableWSREP_SKIP_AUTO_BOOTSTRAP
environment variableWSREP_AUTO_BOOTSTRAP_ADDRESS
environment variable to explicitly choice other node for cluster bootstrappingHow to use it.
mysql
configuration filegalera.cnf
:Warning: World-writable config file
):docker-compose.yml
:To start N MariaDB instances using environment variable:
To start N MariaDB instances using
mysql
configuration file:To start N MariaDB instances using POSIX script helper:
Example usage: