colinmollenhour / mariadb-galera-swarm

MariaDb Galera Cluster container based on official mariadb image which can auto-bootstrap and recover cluster state.
https://hub.docker.com/r/colinmollenhour/mariadb-galera-swarm
Apache License 2.0
217 stars 103 forks source link

large database problem #5

Closed bartoszx closed 7 years ago

bartoszx commented 7 years ago

Hi

I have been doing lot of tests since 2 weeks. Everything works like a charm with small database max 800MB. Problem starts with huge database over several GB.

My first approach was to do this in way you suggested here https://github.com/colinmollenhour/mariadb-galera-swarm/issues/1 This is very interesting but the problem is when your services crash. When it crash, docker removes volumes and starts creating new containers with blank db and all logs and data disappear.

My second approach was to create mysql-seed with no dedicated volume and nodes in global mode with dedicated volume. I've created cluster and imported 3GB of data. Then I've added new server to cluster, docker created new container and started transfering data. After X seconds heltcheck kills container because of no mysql process runing

ERROR 2003 (HY000): Can't connect to MySQL server on '127.0.0.1' (111 "Connection refused")

and this scenario repeats forever

This behavior sometime causes problem with other nodes (dont know why) and other nodes starts to restarting. If you dont have dedicated volume you lost your data.

colinmollenhour commented 7 years ago

Why does your service crash? You should definitely tune MySQL config to not allocate more memory than is available and to prevent major docker malfunction also set the memory limits with docker so that the OOM killer doesn't kill the docker daemon. I've used this same image on Kontena with a 4GB compressed dump (over 30GB data directory once imported into MySQL) and not had this issue but my machines had 64GB RAM.

A persistent volume is definitely a must because in the case of a crash or database upgrade or reboot or other event you want the server to be able to use a small incremental snapshot to rejoin the cluster and not a full snapshot each time, not to mention if all nodes go down you don't want to lose your entire dataset.

colinmollenhour commented 7 years ago

Check out these options from the docker service create docs for a probable solution to your issue with docker restarting containers:

  --restart-condition string       Restart when condition is met (none, on-failure, or any)
  --restart-delay value            Delay between restart attempts (default none)
  --restart-max-attempts value     Maximum number of restarts before giving up (default none)
  --restart-window value           Window used to evaluate the restart policy (default none)
  --stop-grace-period value        Time to wait before force killing a container (default none)

There are other ones which pertain to memory limits and mounted volumes. I didn't use all options in the example because they will vary per user.

bartoszx commented 7 years ago

The problem is with HC command

HEALTHCHECK CMD curl -f -o - http://localhost:8080/ || exit 1 default values --interval=DURATION (default: 30s) --timeout=DURATION (default: 30s) are not acceptable if you need to transfer large database between nodes

Short time causes docker starts to kill containers and stop process on donor site. At the end whole cluster is down

I've tested new configuration

HEALTHCHECK --interval=5m --timeout=3s CMD curl -f -o - http://localhost:8080/ || exit 1 and everything start working correctly

I think HEALTHCHECK CMD should also check if container is currently synchrinizing with donor

colinmollenhour commented 7 years ago

I don't think that just bumping up the timeout is a good solution as that is dependent on your dataset size. If it is larger then it will still be broken. Also, a 5 minute interval means you could be down for 5 whole minutes before it is noticed and what do you do as your database grows? Also, what if it becomes a donor right toward the end of the 5 minutes?

Note, there are actually two health checks running, one of them is considered up when a donor, so just switch the HC command to use the 8081 port number:

https://github.com/colinmollenhour/mariadb-galera-swarm/blob/master/start.sh#L215

bartoszx commented 7 years ago

Hello

I have opportunity to examine above problems with 50GB database. One of my node crashed and I have problem with recovering. Healtcheck command is (as you suggested earlier): "curl -f -o - http://localhost:8081/ || exit 1"

So I am expecting that heltcheck CMD returns something else then 503, additionaly I've enabled logs for healtcheck to see what is going on.

Conclusions:

root@171d3aa89054:/# cat /tmp/8081 [2017-01-20 19:32:19.004121072 +0000 UTC] - dial tcp 127.0.0.1:3306: connection refused

Look at logs

---===--- MariaDB Galera Start Script ---===---
Got NODE_ADDRESS=10.0.0.93
Starting node, connecting to gcomm://10.0.0.148,10.0.0.9
mysqld.sh: uuid is not known...
mysqld.sh: --------------------------------------------------
mysqld.sh: Attempting to recover GTID positon...
2017-01-20 19:23:45 140445647488960 [Note] WSREP: Recovered position: eb9eeee8-de75-11e6-893d-8fd57da7b51a:26371
mysqld.sh: --------------------------------------------------
Galera Cluster Node status: synced
mysqld.sh: Node at 10.0.0.148 is healthy!
mysqld.sh: Found a healthy node! Attempting to join...
mysqld.sh: ----------------------------------
mysqld.sh: Starting with options: --console --wsrep-on=ON --wsrep_cluster_name=news --wsrep_cluster_address=gcomm://10.0.0.148,10.0.0.9 --wsrep_node_address=10.0.0.93:4567 --wsrep_sst_auth=xtrabackup:mypassword --default-time-zone=+00:00 --wsrep_start_position=eb9eeee8-de75-11e6-893d-8fd57da7b51a:26371
2017-01-20 19:23:47 140205943068608 [Note] mysqld (mysqld 10.1.20-MariaDB-1~jessie) starting as process 35 ...
2017-01-20 19:23:47 140205943068608 [Note] WSREP: Read nil XID from storage engines, skipping position init
2017-01-20 19:23:47 140205943068608 [Note] WSREP: wsrep_load(): loading provider library '/usr/lib/galera/libgalera_smm.so'
2017-01-20 19:23:47 140205943068608 [Note] WSREP: wsrep_load(): Galera 25.3.19(r3667) by Codership Oy <info@codership.com> loaded successfully.
2017-01-20 19:23:47 140205943068608 [Note] WSREP: CRC-32C: using hardware acceleration.
2017-01-20 19:23:47 140205943068608 [Note] WSREP: Found saved state: 00000000-0000-0000-0000-000000000000:-1, safe_to_bootsrap: 1
2017-01-20 19:23:47 140205943068608 [Note] WSREP: Passing config to GCS: base_dir = /var/lib/mysql/; base_host = 10.0.0.93; base_port = 4567; cert.log_conflicts = no; debug = no; evs.auto_evict = 0; evs.delay_margin = PT1S; evs.delayed_keep_period = PT30S; evs.inactive_check_period = PT0.5S; evs.inactive_timeout = PT15S; evs.join_retrans_period = PT1S; evs.max_install_timeouts = 3; evs.send_window = 4; evs.stats_report_period = PT1M; evs.suspect_timeout = PT5S; evs.user_send_window = 2; evs.view_forget_timeout = PT24H; gcache.dir = /var/lib/mysql/; gcache.keep_pages_size = 0; gcache.mem_size = 0; gcache.name = /var/lib/mysql//galera.cache; gcache.page_size = 128M; gcache.recover = no; gcache.size = 128M; gcomm.thread_prio = ; gcs.fc_debug = 0; gcs.fc_factor = 1.0; gcs.fc_limit = 16; gcs.fc_master_slave = no; gcs.max_packet_size = 64500; gcs.max_throttle = 0.25; gcs.recv_q_hard_limit = 9223372036854775807; gcs.recv_q_soft_limit = 0.25; gcs.sync_donor = no; gmcast.segment = 0; gmcast.version = 0; pc.announce_timeout = PT3S; pc.checksum = false; pc.igno
2017-01-20 19:23:47 140205943068608 [Note] WSREP: GCache history reset: old(eb9eeee8-de75-11e6-893d-8fd57da7b51a:0) -> new(00000000-0000-0000-0000-000000000000:-1)
2017-01-20 19:23:47 140205943068608 [Note] WSREP: Assign initial position for certification: -1, protocol version: -1
2017-01-20 19:23:47 140205943068608 [Note] WSREP: wsrep_sst_grab()
2017-01-20 19:23:47 140205943068608 [Note] WSREP: Start replication
2017-01-20 19:23:47 140205943068608 [Note] WSREP: Setting initial position to 00000000-0000-0000-0000-000000000000:-1
2017-01-20 19:23:47 140205943068608 [Note] WSREP: protonet asio version 0
2017-01-20 19:23:47 140205943068608 [Note] WSREP: Using CRC-32C for message checksums.
2017-01-20 19:23:47 140205943068608 [Note] WSREP: backend: asio
2017-01-20 19:23:47 140205943068608 [Note] WSREP: gcomm thread scheduling priority set to other:0
2017-01-20 19:23:47 140205943068608 [Note] WSREP: restore pc from disk successfully
2017-01-20 19:23:47 140205943068608 [Note] WSREP: GMCast version 0
2017-01-20 19:23:47 140205943068608 [Note] WSREP: (dccf4702, 'tcp://0.0.0.0:4567') listening at tcp://0.0.0.0:4567
2017-01-20 19:23:47 140205943068608 [Note] WSREP: (dccf4702, 'tcp://0.0.0.0:4567') multicast: , ttl: 1
2017-01-20 19:23:47 140205943068608 [Note] WSREP: EVS version 0
2017-01-20 19:23:47 140205943068608 [Note] WSREP: gcomm: connecting to group 'news', peer '10.0.0.148:,10.0.0.9:'
2017-01-20 19:23:47 140205943068608 [Note] WSREP: (dccf4702, 'tcp://0.0.0.0:4567') connection established to 86dc7af6 tcp://10.0.0.148:4567
2017-01-20 19:23:47 140205943068608 [Note] WSREP: (dccf4702, 'tcp://0.0.0.0:4567') turning message relay requesting on, nonlive peers:
2017-01-20 19:23:47 140205943068608 [Note] WSREP: (dccf4702, 'tcp://0.0.0.0:4567') connection established to 951b7e4c tcp://10.0.0.9:4567
2017-01-20 19:23:47 140205943068608 [Note] WSREP: declaring 86dc7af6 at tcp://10.0.0.148:4567 stable
2017-01-20 19:23:47 140205943068608 [Note] WSREP: declaring 951b7e4c at tcp://10.0.0.9:4567 stable
2017-01-20 19:23:47 140205943068608 [Note] WSREP: Node 86dc7af6 state prim
2017-01-20 19:23:47 140205943068608 [Note] WSREP: view(view_id(PRIM,86dc7af6,263) memb {
    86dc7af6,0
    951b7e4c,0
    dccf4702,0
} joined {
} left {
} partitioned {
})
2017-01-20 19:23:47 140205943068608 [Note] WSREP: save pc into disk
2017-01-20 19:23:47 140205943068608 [Note] WSREP: clear restored view
2017-01-20 19:23:48 140205943068608 [Note] WSREP: gcomm: connected
2017-01-20 19:23:48 140205943068608 [Note] WSREP: Changing maximum packet size to 64500, resulting msg size: 32636
2017-01-20 19:23:48 140205943068608 [Note] WSREP: Shifting CLOSED -> OPEN (TO: 0)
2017-01-20 19:23:48 140205943068608 [Note] WSREP: Opened channel 'news'
2017-01-20 19:23:48 140205943068608 [Note] WSREP: Waiting for SST to complete.
2017-01-20 19:23:48 140205579298560 [Note] WSREP: New COMPONENT: primary = yes, bootstrap = no, my_idx = 2, memb_num = 3
2017-01-20 19:23:48 140205579298560 [Note] WSREP: STATE EXCHANGE: Waiting for state UUID.
2017-01-20 19:23:48 140205579298560 [Note] WSREP: STATE EXCHANGE: sent state msg: ebc0ede1-df45-11e6-b37e-eecc9ebc8311
2017-01-20 19:23:48 140205579298560 [Note] WSREP: STATE EXCHANGE: got state msg: ebc0ede1-df45-11e6-b37e-eecc9ebc8311 from 0 (b564cb7345ed)
2017-01-20 19:23:48 140205579298560 [Note] WSREP: STATE EXCHANGE: got state msg: ebc0ede1-df45-11e6-b37e-eecc9ebc8311 from 1 (2894be347102)
2017-01-20 19:23:48 140205579298560 [Note] WSREP: STATE EXCHANGE: got state msg: ebc0ede1-df45-11e6-b37e-eecc9ebc8311 from 2 (bb7c6231bb8c)
2017-01-20 19:23:48 140205579298560 [Note] WSREP: Quorum results:
    version    = 4,
    component  = PRIMARY,
    conf_id    = 28,
    members    = 2/3 (joined/total),
    act_id     = 26371,
    last_appl. = -1,
    protocols  = 0/7/3 (gcs/repl/appl),
    group UUID = eb9eeee8-de75-11e6-893d-8fd57da7b51a
2017-01-20 19:23:48 140205579298560 [Note] WSREP: Flow-control interval: [28, 28]
2017-01-20 19:23:48 140205579298560 [Note] WSREP: Shifting OPEN -> PRIMARY (TO: 26371)
2017-01-20 19:23:48 140205942754048 [Note] WSREP: State transfer required:
    Group state: eb9eeee8-de75-11e6-893d-8fd57da7b51a:26371
    Local state: 00000000-0000-0000-0000-000000000000:-1
2017-01-20 19:23:48 140205942754048 [Note] WSREP: New cluster view: global state: eb9eeee8-de75-11e6-893d-8fd57da7b51a:26371, view# 29: Primary, number of nodes: 3, my index: 2, protocol version 3
2017-01-20 19:23:48 140205942754048 [Warning] WSREP: Gap in state sequence. Need state transfer.
2017-01-20 19:23:48 140205549942528 [Note] WSREP: Running: 'wsrep_sst_xtrabackup --role 'joiner' --address '10.0.0.93' --datadir '/var/lib/mysql/'   --parent '35'  '' '
WSREP_SST: [INFO] Streaming with tar (20170120 19:23:48.292)
WSREP_SST: [INFO] Using socat as streamer (20170120 19:23:48.294)
WSREP_SST: [INFO] Stale sst_in_progress file: /var/lib/mysql//sst_in_progress (20170120 19:23:48.298)
WSREP_SST: [INFO] Evaluating socat -u TCP-LISTEN:4444,reuseaddr stdio | tar xfi - --recursive-unlink -h; RC=( ${PIPESTATUS[@]} ) (20170120 19:23:48.307)
2017-01-20 19:23:48 140205942754048 [Note] WSREP: Prepared SST request: xtrabackup|10.0.0.93:4444/xtrabackup_sst
2017-01-20 19:23:48 140205942754048 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
2017-01-20 19:23:48 140205942754048 [Note] WSREP: REPL Protocols: 7 (3, 2)
2017-01-20 19:23:48 140205942754048 [Note] WSREP: Assign initial position for certification: 26371, protocol version: 3
2017-01-20 19:23:48 140205637334784 [Note] WSREP: Service thread queue flushed.
2017-01-20 19:23:48 140205942754048 [Warning] WSREP: Failed to prepare for incremental state transfer: Local state UUID (00000000-0000-0000-0000-000000000000) does not match group state UUID (eb9eeee8-de75-11e6-893d-8fd57da7b51a): 1 (Operation not permitted)
     at galera/src/replicator_str.cpp:prepare_for_IST():482. IST will be unavailable.
2017-01-20 19:23:48 140205579298560 [Note] WSREP: Member 2.0 (bb7c6231bb8c) requested state transfer from '*any*'. Selected 1.0 (2894be347102)(SYNCED) as donor.
2017-01-20 19:23:48 140205579298560 [Note] WSREP: Shifting PRIMARY -> JOINER (TO: 26371)
2017-01-20 19:23:48 140205942754048 [Note] WSREP: Requesting state transfer: success, donor: 1
2017-01-20 19:23:48 140205942754048 [Note] WSREP: GCache history reset: old(00000000-0000-0000-0000-000000000000:0) -> new(eb9eeee8-de75-11e6-893d-8fd57da7b51a:26371)
2017-01-20 19:23:51 140205587691264 [Note] WSREP: (dccf4702, 'tcp://0.0.0.0:4567') turning message relay requesting off
Received TERM|INT signal. Shutting down...
ERROR 2003 (HY000): Can't connect to MySQL server on '127.0.0.1' (111 "Connection refused")
start.sh: Trapped error on line 198
colinmollenhour commented 7 years ago

Maybe the healthcheck should be updated to report "healthy" between the time the mysqld is exec'd and it joins successfully, but I'm not sure the best way to do this.. Perhaps the presence of a state file could help?

The /var/lib/mysql/gvwstate.dat file could be useful. I suspect that this file is missing during the state transfer so something like the following could be used:

#!/bin/bash
if test -f /var/lib/mysql/gvwstate.dat; then
    touch /tmp/joined.flag
fi
if test -f /tmp/joined.flag; then
    curl -f -o - http://localhost:8081/ || exit 1
fi
exit 0 # Report healthy until joined once

I think you'd also need to add rm -f /tmp/joined.flag to mysqld.sh before the server starts in case the gvwstate.dat file existed on startup and the script deleted it.

Please give that a shot and let me know how it goes.

Thanks!

bartoszx commented 7 years ago

Thanks for reply. gvwstate.da is always available but I got your point I've modified your script and it works perfectly

root@LTX005:/home/users/bartek/mariadb-galera-swarm# cat healtcheck.sh
#!/bin/bash
if ! test -f /var/lib/mysql/sst_in_progress; then
    touch /tmp/joined.flag
fi
if test -f /tmp/joined.flag; then
    curl -f -o - http://localhost:8081/ || exit 1
fi
exit 0 # Report healthy until joined once
colinmollenhour commented 7 years ago

Ahh, perfect! I forgot about that file!

Do you care to submit a PR with the improve healthcheck?

Thanks!

bartoszx commented 7 years ago

Yes sure, let me do more tests and I will submit tomorrow.

Thanks for help!

colinmollenhour commented 7 years ago

Since sst_in_progress is deleted when sst finishes then it could probably be simplified further:

#!/bin/bash
test -f /var/lib/mysql/sst_in_progress || curl -f -o - http://localhost:8081/ || exit 1

Also, might want to add an rm -f /var/lib/mysql/sst_in_progress early in mysqld.sh just in case the container was killed in the middle of an sst previously although I'd guess that MySQL would delete this file itself pretty early on.

bartoszx commented 7 years ago

More simplified solutions could be

HEALTHCHECK CMD test -f /var/lib/mysql/sst_in_progress || curl -f -o - http://localhost:8080/ || exit 1

or you do not have to change anything because I can overwrite HEALTHCHECK CMD in docker-compose.yml

colinmollenhour commented 7 years ago

Ok, I've gone ahead and updated the default to use the sst_in_progress test since this is likely to be a recurring issue for other users as well. Thanks for your help, and I hope this container serves you well!

bartoszx commented 7 years ago

Hi Colin Let me reopen this issue 'cause my problem still regarding large database and healthcheck status. After few months I've decided to add another node to swarm. Docker created new instance of mysql and started to sync data. During sync period, docker shows that container is healthy. Syncing tooks about 6 hours so during this time every 4 request shows "Connection refused". As long as container has Healthy state, docker will route traffic to it. How to avoid this problem?

colinmollenhour commented 7 years ago

How are you routing traffic? The various healthchecks have different purposes in different cases. In one case the scheduler needs to know if the instance should be killed. I think for the HEALTHCHECK CMD this needs to work as it does so that the node is not killed. In other cases the system needs to know if traffic can be routed. Your routing system needs to be aware of the sync status and act accordingly. This is what the 8080 healthcheck is for, but how you make use of it depends on your system architecture. Currently I'm using HAProxy but there are other ways to do it. Unfortunately just using the swarm routing is not sufficient unless swarm adds a way to do separate healthchecks for "don't kill me" status vs "healthy".

bartoszx commented 7 years ago

That is the reason I am writing this :). I was wondering how you achieve that because I saw no option to do that without additional software like haproxy. Can you share you haproxy configuration? p.s In other non-swarm project I was using maxscale but it was very CPU intensive

colinmollenhour commented 7 years ago

Looks like Docker 17.09 will add support for a "start_period" which could be a good workaround. In this case you would make the healthcheck only report healthy if synced and set the start_period to be sufficiently long to allow a sync to occur. See https://github.com/docker/cli/pull/475

Regarding HAProxy my config looks like this:

global
  log 127.0.0.1 local0 notice

defaults
  mode tcp
  log global
  timeout client     1h
  timeout server     1h
  timeout connect 1000

listen stats
  bind *:1936
  mode http
  stats enable
  stats uri /
  stats refresh 60s
  stats show-node
  stats show-legends

frontend galera-write-unix
  bind /run/galera-write.sock mode 777
  default_backend galera-write

frontend galera-write-tcp
  bind *:3306
  default_backend galera-write

frontend galera-read-unix
  bind /run/galera-read.sock mode 777
  default_backend galera-write

frontend galera-read-tcp
  bind *:3307
  default_backend galera-read

backend galera-write
  balance first
  option tcpka
  option httpchk GET / HTTP/1.0
  default-server port 8080 inter 1s downinter 1s rise 1 fall 1 maxconn 256 maxqueue 128 resolve-prefer ipv4
  {{HOSTS}}
  server node${num} ${ip}:3306 check
  {{/HOSTS}}

backend galera-read
  balance roundrobin
  option tcpka
  default-server inter 1s downinter 1s rise 1 fall 1 maxconn 256 maxqueue 128 resolve-prefer ipv4
  {{HOSTS}}
  server node${num} ${ip}:3306 track galera-write/node${num}
  {{/HOSTS}}

This provides two frontends, one for all nodes roundrobin and one that goes to just the first node for transactions that are not cluster-aware.

The config has to be rendered by my "haproxy-service" container which can be found here: https://github.com/colinmollenhour/haproxy-service You could also do the same using confd but since I'm using DNS the bash script does the job well without confd.

bartoszx commented 7 years ago

Colin Thanks for sharing your haproxy config. I've started implemented same with help of ansible. However start_period for healtcheck seems to be what I was looking for. I need to check advantages or disadvantages over haproxy approach

colinmollenhour commented 7 years ago

Why not use both? :)

If your app is already cluster-friendly on the writes then you probably don't need HAProxy, but if you need all writes directed to the same node to ensure single-node level transactions then HAProxy is a pretty good solution.