default value at false for disable_auto_sst and startup_timeout

ldangeard-orange commented 6 years ago

Hello, With the new version develop 36.10 (dev), by default the value of disable_auto_sst is FALSE.

However, when you have a big database, the copy with Xtrababckup (SST) need more 60 seconds (cf_mysql.mysql.startup_timeout with 60 by default) So monit mariadb_ctrl tries to restart the base, while the transfer is not finished. Many error messages :

2018-01-16 7:24:59 140202241787776 [ERROR] Could not open mysql.plugin table. Some plugins may be not loaded 2018-01-16 7:24:59 140202240195328 [Warning] Failed to load slave replication state from table mysql.gtid_slave_pos: 1146: Table 'mysql.gtid_slave_pos' doesn't exist 2018-01-16 7:24:59 140202241787776 [ERROR] Can't open and lock privilege tables: Table 'mysql.servers' doesn't exist 2018-01-16 7:24:59 140202241787776 [ERROR] Fatal error: Can't open and lock privilege tables: Table 'mysql.user' doesn't exist

So I think , it's better to block SST because you need to analyse why your instance is desync.

If you want to maintain defaultvalue FALSE for disable_auto_sst, you need to : . increase startup_timeout . monitoring if instance execute SST, for example with wsrep_cluster_conf_id : http://galeracluster.com/documentation-webpages/monitoringthecluster.html

cf-gitbot commented 6 years ago

We have created an issue in Pivotal Tracker to manage this:

https://www.pivotaltracker.com/story/show/154352380

The labels on this github issue will be updated when the story is started.

GETandSELECT commented 6 years ago

hello @ldangeard-orange

I am curious what's the DB size in your env?

ldangeard-orange commented 6 years ago

hello @GETandSELECT, on my DB Test : 15Gb, but we wish to increase to 50Gb

ldangeard-orange commented 6 years ago

Hello, I test with cf_mysql.mysql.startup_timeout=600 , many problem with monit when we stop one noede etc...

So, it's not a good idea to have disable_auto_sst with false

menicosia commented 6 years ago

@ldangeard-orange I'm a little confused. SST should only happen during BOSH pre-start phase, which is not governed by monit. Are you restarting VMs outside of the bosh director's control? It's very important to stick to using BOSH to manage the VMs, else the pre-start phase will not run.

Am I understanding the problem, and explaining this correctly?

ldangeard-orange commented 6 years ago

Hello @menicosia, there are several cases: 1) when you restore MariaDB with SHIELD on a node, you have need to stop the other nodes with a monit stop maraidb_ctrl. After, you execute monit start mariadb_ctrl, and Galera run in STT 2) when the activity is intense and the gcahe galera is too small, node becomes desynchronized, Galera goes into SST

ctaymor commented 6 years ago

Hi @ldangeard-orange,

In this case, you'll want to run the pre-start for the mariadb_ctrl job before the job starts so the node is able to properly perform the SST. The best way to perform this is probably to bosh stop the individual mysql nodes, perform the restore, then bosh start the nodes again. This will ensure that the pre-start scripts are run, and should give plenty of time for the SST to run. If you use monit to stop and start, you'll have to run the pre-start script manually.
Yeah, in this case you probably want to increase the gcache_size to a value that large enough that you don't run into this problem.

bgandon commented 6 years ago

Hi Marco (@menicosia) and Caroline (@ctaymor),

I'm jumping into this issue in order to clarify things. I'm working with Laurent (@ldangeard-orange) in the Orange FR database experts team. Here, people have a very strong expertise on production data services. As a contractor and BOSH expert in France, I'm helping them into pivoting towards authoring BOSH releases and recommendations that benefit or encapsulate their expertise. Currently, we focus on MongoDB, Cassandra and MariaDB (with this cf-mysql-release).

Here in this issue, the situation that Laurent describes is the following :

A Galera node is late compared to other nodes. This can be a consequence of a too small Gcache (Galera cache), or a very high activity on a node that receives and initiates all transactions (this is typically the node that is targeted by the swichboard proxies). Or it can be any reason that is documented in Troubleshooting and Diagnostics chapter of the PCF tile v1.10 documentation.
Now that disable_auto_sst is false, this late node naturally starts an SST.
The SST stops the MariaDB daemon and starts another process, in such way that the PID file tracked by Monit doesn't refer to any live process anymore.
The SST moves all the content of /var/vcap/store/mysql (the Mysql datadir) into a .sst subdirectory, so that /var/vcap/store/mysql is nearly emptied.
The SST takes more than 60 seconds because the database is big.
As the monit timeout is 60 seconds, monit jumps in and says “Hey, this deamon is down whereas it should not! Let's start it again!” But in fact, Monit should keep quiet and wait for the SST to finish.
Monit messes up things, because it creates a fresh new database in /var/vcap/store/mysql whereas this directory should stay empty. And ironically, Monit doesn't even succeed in this, because MariaDB is missing some files that have been moved to the .ssh subdirectory (see the error message pasted by Laurent in his very first post here in this thread). So, Monit is going to blindly retry and fail several times at restarting MariaDB.
The SST finishes and all the content of the .sst subdirectoty is copied back to /var/vcap/store/mysql. But this destination directory is no more empty, so the SST stops with an error because it is not supposed to clobber any existing database file (the error says something like “datadir is not empty”).
The MariaDB node is crashed for good.

I our case, step 1 was triggered by nodes joining back a cluster after a TPCC benchmark. Indeed, we had monit-stopped 2 out of 3 nodes before running a TPCC benchmark on the remaining node. And when the 2 other nodes join the cluster back with a monit start, then the SST was triggered and the failure scenario happened.

So, it's correct that stopping nodes with bosh stop and restarting them with bosh start is better and might work. But the point here is that the SST situation could be triggered naturally in a loaded cluster, as described above. And we have seen this happening in production once in a while.

Normally, Monit should not try to restart the daemon while the system is doing an SST. Maybe the SST script should write its PID into the PID file that Monit is tracking, so that Monit is happy with a live process. But this might require some PRs to be pushed in Galera so that the SST writes its PID in a specific PID file.

What Laurent says is that, waiting for such changes to happen, it would be safer to move back to disable_auto_sst: true. Indeed, when an SST is required the node just stops with a specific log line (mentioned the Interruptor section of the PCF tile v1.10 documentation) that seems to be added to the SST script specifically. Then an operator needs to be alerted about this log line and operate an SST manually at the proper time, considering production constraints.

For a long-term solution, assuming that this log line is the result of a customized SST script, why not have this script write its own PID in the proper file so that Monit keeps being happy? This is just a guess. Now that I (hopefully) clarified the issue, I let you jump in a suggest any fix you find most relevant.

As a conclusion, I hope this will help in solving this issue, which is a concern for anyone relying on the default values proposed by this BOSH release.

Best, Benjamin

cloudfoundry / cf-mysql-release

default value at false for disable_auto_sst and startup_timeout #195