colinmollenhour / mariadb-galera-swarm

MariaDb Galera Cluster container based on official mariadb image which can auto-bootstrap and recover cluster state.
https://hub.docker.com/r/colinmollenhour/mariadb-galera-swarm
Apache License 2.0
217 stars 102 forks source link

MariaDB Galera Cluster with Network Volumes #41

Closed jaschaio closed 6 years ago

jaschaio commented 6 years ago

I was able to create a MariaDB Galera Cluster with the example stack file.

But I am concerned that it uses local docker volumes. What happens if a docker swarm node fails and I have to recreate it on a new host? Will the data be copied from the rest of the healthy MariaDB nodes? What happens performance wise if we are talking about very large databases? ( > 200GB and more)

To persist data even if a swarm node fails I am usually using rexray volumes to attach external block storage to my swarm nodes.

The problem is that I can't scale the MariaDB Cluster across different nodes, as I can only attach the block storage to a single node at a time.

So this doesn't work:

services:

    [… seed service]

    node:
        image: colinmollenhour/mariadb-galera-swarm:10.1
        [… networks, environment, secrets, commands etc.]
        volumes:
            - database:/var/lib/mysql
        deploy:
            replicas: 3

volumes:
    database:
        driver: rexray/dobs

It would only work if all three replicas would sit on the same docker swarm node which makes no sense as I would have no failover if the swarm node fails for some reason.

Aparently locking doesn't work neither when all mariadb nodes use the exact same volume. (I wasn't able to start the swarm with a single volume).

Another approach could be to create three different node services which each use a different rexray volume. But than I don't have the automatic load balancing anymore which docker swarm mode provides when multiple replicas are run of the same services. Is there something I could put in front of MariaDB to load balance the requests?

Or would a shared file system like GlusterFS make sense?

To be honest this is more a best practices question than a bug. I just dont know where to ask for this but feel that this is a very important question that is not well documentated anywhere.

jaschaio commented 6 years ago

This might be a solution to the problem:

Its possible to give a unique name to each rexray volume for each replica to use:

services:

    [… seed service]

    node:
        image: colinmollenhour/mariadb-galera-swarm:10.1
        [… networks, environment, secrets, commands etc.]
        volumes:
            - database:/var/lib/mysql
        deploy:
            replicas: 3

volumes:
    database:
        name: '{{.Service.Name}}-{{.Task.Slot}}'
        driver: rexray/dobs

After deploying the stack and creating the galera cluster you will get the following if you output docker volume ls:

rexray/dobs:latest   mysql-node-1
rexray/dobs:latest   mysql-node-2
rexray/dobs:latest   mysql-node-3
rexray/dobs:latest   mysql-seed-1

(The mysql- prefix depends on your stack name)

Not sure if I am overlooking something but this seems solid.

colinmollenhour commented 6 years ago

Everyone has different needs and preferences, but considering that Galera handles SST automatically I think local volumes are actually quite good in most cases. If you have three nodes running and want to upgrade all three nodes to run on more powerful machines, just take down and replace one node at a time and Galera will populate the new local volumes for you. Looks like you found another solution though, thanks for providing it! Feel free to submit a PR with your setup in a new directory in /examples if you think others might benefit.

I'm curious, how is the performance with your setup compared to local volumes?

jaschaio commented 6 years ago

Thanks for your answer and great question! I have seen ocasionally some errors when doing the following queries and than trying to connect immediately to the new database with the new user and password:

CREATE DATABASE some_db;
CREATE USER some_user@'%' IDENTIFIED BY some_password;
GRANT ALL PRIVILEGES ON some_db.* TO some_user@'%';
FLUSH PRIVILEGES;

The error is simply Access denied for user 'some_user'@'%' to database 'some_db'.

It happens in less than 10% of the cases and seems to has to do with rapidly doing a lot of queries (importing a small mysql dump) right after flushing privileges. If I only create the database, user and password without importing a dump afterwards I don't see any of those errors.

I haven't had time yet to investigate, but now that you mention it the reason might be a performance issue of the file system. The other reason might be that changes haven't yet propagated to another node of the database as all requests are load balanced by the overlay network. (Not sure if that can even happen when using a galera cluster, I thought all nodes have always the same state when they are up).

Do you know a good and simple way to test performance so than I can compare blockstorage vs local volumes?

On the same note: How performant is SST if you have very large amounts of data?

jaschaio commented 6 years ago

I actually just tested this with the script I was using anyway. So this is not scientific but actually still interesting.

Creating Database, User, Password, Flushing privileges and Importing a 44KB mysql dump with attached block storage:

Execution time : 4.2606270313263 seconds
Execution time : 4.2530460357666 seconds
Execution time : 4.1727960109711 seconds
Execution time : 4.3930230140686 seconds

Doing the exact same thing with local volumes:

Execution time : 0.58489298820496 seconds
Execution time : 0.44117593765259 seconds
Execution time : 0.5139148235321 seconds
Execution time : 0.013115167617798 seconds
Execution time : 0.4229941368103 seconds
Execution time : 0.56238698959351 seconds
Execution time : 0.012302875518799 seconds
Execution time : 0.42536997795105 seconds
Execution time : 0.60089898109436 seconds
Execution time : 0.01095986366272 seconds
Execution time : 0.45150113105774 seconds
Execution time : 0.58065295219421 seconds

This is on DigitalOcean btw.

So yes, you are right. Local volumes are about 10 times faster. So even if the SST would take some time if the database is really huge it probably still worth using local volumes because of the performance gains.

I am still seeing ocassionally the Access denied for user 'some_user'@'%' to database 'some_db' errors. Any theory what that might be? Am I right that it could take some times until privileges propagate? The only way to prevent them is to wait for at least a second between flushing privileges and trying to connect with the new user.

colinmollenhour commented 6 years ago

Strange, I've never experienced the permission issue.. Perhaps FLUSH PRIVILEGES doesn't replicate?

Thanks for sharing your performance findings, that is indeed a huge difference!

SST is very fast as it uses a direct unencrypted socket (via socat). You can use compression although it may or may not speed anything up. I think with a 60GB mysql data directory it takes about 5 minutes with pigz compression IIRC. It might actually be faster without compression but as I don't plan to move nodes very often I'm happy with 5 minutes.

Some of my tuning related to SST

[mysqld]
wsrep_provider_options="gcache.size=2048M; gcache.keep_pages_size=1024M; gcache.recover=yes;"
wsrep_slave_threads=4

[sst]
inno-apply-opts="--use-memory=20G"
compressor="pigz --fast --processes 4"
decompressor="pigz --decompress"
jaschaio commented 6 years ago

Thanks for posting that!

It looks like FLUSH PRIVILEGES is not replicated but as well not needed in this case as I am executing a GRANT.

But apparently all things that happen on the mysql. tables are treated a bit different in a Galera Cluster, so incorporating a very small delay of ~ 200 - 500 ms seems to prevent the error from happening.