autopilotpattern / mysql

Implementation of the autopilot pattern for MySQL
Mozilla Public License 2.0
172 stars 68 forks source link

Replication failure #71

Closed mbbender closed 7 years ago

mbbender commented 7 years ago

If I create a master database by itself

docker-compose up -d

Then connect to this database and load in some initial content. I have two databases. One is about .5Gb the other about 1Gb.

Then when I try to create a slave

docker-compose scale mysql=2

The replication fails with a line about Can't create database because it exists. I'm guessing this is maybe a race condition between the initial backup and replication starting. I've reproduced this scenario twice in a row.

tgross commented 7 years ago

@mbbender the initial snapshot happens very quickly, so yes if you're adding a new DB manually after launch it's unlikely that it's going to be in the first snapshot. The next snapshot will of course have the data.

You should be able to kick off another snapshot manually when you're shelled into the container via python /usr/local/bin/manage.py write_snapshot. This function is what the snapshot_task calls once it's determined that the BACKUP_TTL has expired.

mbbender commented 7 years ago

That did execute the backup, but somehow the initial sync on the slave is getting halted due to the error:

Can't create database 'dbname'; database exists' on query. Default database: 'dbname'. Query: 'CREATE DATABASE `dbname` DEFAULT CHARACTER SET = `utf8` DEFAULT COLLATE = `utf8_bin`'

I started master, loaded content, manually took snapshot, verified it was in manta and keys looked right in consul, scaled to mysql=2 and this happens.

tgross commented 7 years ago

Interesting. The pre_start function has a fairly small scope -- it just gets the backup from Manta and then uses innobackupex in restore_from_snapshot. So there's shouldn't be any way for another DB to exist at that point because we have a fresh disk.

Can you provide a gist with the logs from the replica from the beginning up until that point? Preferably with DEBUG logging on.

mbbender commented 7 years ago

Starting with no backups in manta, no keys in consul... running consul from autopilotpattern/consul and have removed the consul node from compose file in the mysql project.

Master Logs https://gist.github.com/mbbender/0e0ea435fec5924e5f64724192734545

Opened Master in Sequel Pro and imported two databases after about 10 mins from creating master.

Ran python /usr/local/bin/manage.py write_snapshot on master host to force new backup after about 10 mins after importing the last database. Output showed the completed OK! and appeared to execute without issue.

Created a slave docker-compose scale mysql=2 Slave logs after about 10 mins from creation https://gist.github.com/mbbender/9e8e0d9402488abd29cae4089449cbff

Slave SHOW SLAVE STATUS \G https://gist.github.com/mbbender/99a12de840a14bcb627b071b45a0102b

Master SHOW MASTER STATUS \G https://gist.github.com/mbbender/b7aefe827cf03fa37a0399900556b229

tgross commented 7 years ago

Slave logs after about 10 mins from creation

I don't see anything in here with the preStart steps. Actually, I don't see any of the manage.py logs in these replica server's logs -- just the MySQL logs. But it looks the DB is getting initialized here.

You're getting a lot of these at the end of the primary's logs; your Consul agent doesn't look like it's joining successfully to the Consul server. Is the mysql-primary actually marked as healthy in Consul when you start the replica up? I suspect that's the underlying problem.

2016/12/07 16:20:10 Failed to join 192.168.129.93: received invalid msgType (72), expected pushPullMsg (6) from=192.168.129.93:8500 2016/12/07 16:20:10 Failed to join 192.168.129.94: received invalid msgType (72), expected pushPullMsg (6) from=192.168.129.94:8500 2016/12/07 16:20:10 * Failed to join 192.168.129.95: received invalid msgType (72), expected pushPullMsg (6) from=192.168.129.95:8500, retrying in 10s 2016/12/07 16:20:20 2016/12/07 16:20:20 [INFO] agent: (LAN) joining: [consul.svc.380b56fd-6e8c-406b-b4ae-4710ae9af147.us-east-1.cns.joyent.com:8500] 2016/12/07 16:20:20 2016/12/07 16:20:20 [INFO] agent: (LAN) joined: 0 Err: 3 error(s) occurred:

mbbender commented 7 years ago

Well i don't know why the logs wouldn't be printed out. Isn't the default logging set to DEBUG?

I did notice those lines and was going to bring that up in a separate issue. Aren't those logs due to the fact it's trying to gossip on port 8500 vs 8100? It looks like consul expects to use the gossip protocol on port 8100 if LAN based and 8200 if WAN based. Port 8500 is for HTTP protocol.

Master looks healthy to me in consul. The service is mysql-primary with 2 passing health checks. The keys successfully identify the correct host as primary and all the log file keys and bin log look accurate.

mbbender commented 7 years ago

FYI, I took down everything and composed a new test from just the mysql project (not including an already existing consul raft). I get the same errors on joining the cluster in the logs.

Also noticed

2016/12/07 19:28:54 * Failed to join 192.168.129.118: read tcp 192.168.129.117:62666->192.168.129.118:8500: i/o timeout
2016/12/07 19:28:54     2016/12/07 19:28:54 [ERROR] agent: max join retry exhausted, exiting
2016/12/07 19:28:54     2016/12/07 19:28:54 [INFO] agent: requesting shutdown
2016/12/07 19:28:54     2016/12/07 19:28:54 [INFO] consul: shutting down client
2016/12/07 19:28:54     2016/12/07 19:28:54 [WARN] serf: Shutdown without a Leave
2016/12/07 19:28:54     2016/12/07 19:28:54 [ERR] dns: error starting tcp server: accept tcp 127.0.0.1:8600: use of closed network connection
2016/12/07 19:28:54     2016/12/07 19:28:54 [INFO] manager: shutting down
2016/12/07 19:28:54     2016/12/07 19:28:54 [INFO] agent: shutdown complete
2016/12/07 19:28:54 coprocess[consul-agent (host:consul.svc.380b56fd-6e8c-406b-b4ae-4710ae9af147.us-east-1.cns.joyent.com)] exited (%!s(int=1)): exit status 1

Looks like the compose file doesn't expose 8600... not sure if this is desired or not.

mbbender commented 7 years ago

I do see the mysql-backup-running key shows up in consul and never goes away in all instances I've been testing.

mbbender commented 7 years ago

Should I be able to build this and use my image in place of autopilotpattern/mysql ?

Attaching to mbbmysql_mysql_1
Exception in thread Thread-3:
Traceback (most recent call last):
  File "threading.py", line 801, in __bootstrap_inner
  File "threading.py", line 754, in run
  File "compose/cli/log_printer.py", line 197, in watch_events
  File "compose/project.py", line 343, in events
  File "site-packages/docker/client.py", line 268, in _stream_helper
  File "site-packages/docker/client.py", line 178, in _result
  File "site-packages/docker/client.py", line 174, in _raise_for_status
APIError: 400 Client Error: Bad Request ("(NotImplemented) events is not implemented (bfdef350-bcb9-11e6-a492-e30019549cab)")

mbbmysql_mysql_1 exited with code 17
tgross commented 7 years ago

Isn't the default logging set to DEBUG?

Looks like no. It uses the value of the LOG_LEVEL env var but defaults to INFO if not set. So that's a documentation bug to fix.


Looks like the compose file doesn't expose 8600... not sure if this is desired or not.

We're not using the DNS port so that's fine. You've got an abnormal end there, so that's why you're seeing the cascade of errors after the agent halts. That's just Consul being a little "dirty" with the way it shuts down in that case.

You said earlier that you're using a separate Consul cluster... is it a different version from the agent inside the MySQL container?


Should I be able to build this and use my image in place of autopilotpattern/mysql ?

You're deploying on Triton, right? That error is because Triton doesn't implement docker events. Are you running up without the -d flag perhaps?

ref https://apidocs.joyent.com/docker/divergence#roadmap

docker events will not be implemented in the near future. This causes a trivial error when attempting to watch logs with docker-compose. Users who intend to use docker events for container orchestration purposes should look at ContainerPilot’s automatic service registration, health checking, and configuration.

mbbender commented 7 years ago

Tim, I'll set debug to INFO and try again.

I was just using a pull from autopilot/consul then scaled it to 3. I did get the same issue when shutting everything down and just running this mysql project in isolation without an existing consul cluster.

I was deploying without -d on triton to try to find out why my built image wouldn't start.

mbbender commented 7 years ago

I'm getting the exact same behavior running all this locally. So if you just run this project locally or on triton it works as expected?

tgross commented 7 years ago

If I'm running the project unmodified, yes, it's working fine for me both locally and on Triton. So getting those debug logs and whatever changes made to the configurations would be really helpful in determining what's going wrong.

mbbender commented 7 years ago

Fresh clone Ran ./setup.sh Added my manta_bucket and manta_user configs to _env

docker-compose up -d

mysql_mysql_1 container log after a few mins of running (I believe the consul registration problem is in here and perhaps unrelated to replication) stdout https://gist.github.com/mbbender/bc778f3e6f4f6a9d0d31d90962488900 stderr https://gist.github.com/mbbender/a81ad3d10afd3c42e68f1799d49eaa57

Here are the consul container's logs https://gist.github.com/mbbender/a96a98408f374a967e4ce21c089258d0

mbbender commented 7 years ago

If I remove the :8500 from line 39 in https://github.com/autopilotpattern/mysql/blob/master/bin/manager/containerpilot.py It seems to resolve some of the consul issues. I start to be able to replicate data, but am still unable to import my databases and replicate. I think it may possibly be data type issue. They are utf8 / utf8_bin tables and it seems when I create a new table of this kind and try to generate a slave it causes errors, but if I do a latin1 encoding it doesn't. I don't know enough about why this would cause an issue, but am doing some more testing around that today if I can.

tgross commented 7 years ago

I think I've partially figured out the Consul problem -- when I upgraded this blueprint to use Consul 0.7 I used the autopilotpattern/consul instance, which isn't what we want for a single-node use case. And I've made sure it's obvious where to set the DEBUG logging. I've opened a work-in-progress PR here: https://github.com/autopilotpattern/mysql/pull/73

Now that I have Consul corrected and the debug logging on, I've done a scale-up and replication is still being set up as expected. The logs of a single health check in MySQL looks like this:

2016/12/09 13:57:58 mysql-primary.health.RunWithTimeout start
2016/12/09 13:57:58 mysql-primary.health.Cmd.Start
2016/12/09 13:57:58 mysql-primary.health.run waiting for PID 16244:
2016/12/09 13:57:58     2016/12/09 13:57:58 [INFO] agent: (LAN) joining: [consul.svc.0f06a3e0-a0da-eb00-a7ae-989d4e44e2ad.us-east-1.cns.joyent.com:8500]
2016/12/09 13:57:58 DEBUG manage [health] node.health start
2016/12/09 13:57:58 DEBUG manage [health] node.assert_initialized_for_state start
2016/12/09 13:57:58 DEBUG manage [health] node.assert_initialized_for_state end: True
2016/12/09 13:57:58 DEBUG manage [health] node.is_primary start
2016/12/09 13:57:58 DEBUG manage [health] mysql.get_primary start
2016/12/09 13:57:58 DEBUG manage [health] mysql.query start
2016/12/09 13:57:58 DEBUG manage [health] mysql.wait_for_connection start
2016/12/09 13:57:58 DEBUG manage [health] mysql.wait_for_connection end
2016/12/09 13:57:58 DEBUG manage show slave status ()
2016/12/09 13:57:58 DEBUG manage [health] mysql.query end: []
2016/12/09 13:57:58 DEBUG manage [health] mysql.query start
2016/12/09 13:57:58 DEBUG manage show slave hosts ()
2016/12/09 13:57:58 DEBUG manage [health] mysql.query end: []
2016/12/09 13:57:58 DEBUG manage could not determine primary via mysqld status: no prior replication setup found
2016/12/09 13:57:58 DEBUG manage [health] consul.get_primary start
2016/12/09 13:57:58 DEBUG manage [{u'Node': {u'Node': u'04434aead2a3', u'TaggedAddresses': {u'wan': u'192.168.128.21', u'lan': u'192.168.128.21'}, u'ModifyIndex': 21, u'CreateIndex': 6, u'Address': u'192.168.128.21'}, u'Checks': [{u'Node': u'04434aead2a3', u'CheckID': u'mysql-primary-04434aead2a3', u'Name': u'mysql-primary-04434aead2a3', u'ServiceName': u'mysql-primary', u'Notes': u'TTL for mysql-primary set by containerpilot', u'ModifyIndex': 21, u'Status': u'passing', u'ServiceID': u'mysql-primary-04434aead2a3', u'Output': u'ok', u'CreateIndex': 20}, {u'Node': u'04434aead2a3', u'CheckID': u'serfHealth', u'Name': u'Serf Health Status', u'ServiceName': u'', u'Notes': u'', u'ModifyIndex': 6, u'Status': u'passing', u'ServiceID': u'', u'Output': u'Agent alive and reachable', u'CreateIndex': 6}], u'Service': {u'Service': u'mysql-primary', u'Tags': None, u'ModifyIndex': 21, u'EnableTagOverride': False, u'ID': u'mysql-primary-04434aead2a3', u'Address': u'192.168.128.21', u'CreateIndex': 20, u'Port': 3306}}]
2016/12/09 13:57:58 DEBUG manage [health] consul.get_primary end: (u'mysql-primary-04434aead2a3', u'192.168.128.21')
2016/12/09 13:57:58 DEBUG manage [health] node.is_primary end: True
2016/12/09 13:57:58 DEBUG manage [health] consul.renew_session start
2016/12/09 13:57:58 DEBUG manage [health] consul.get_session start
2016/12/09 13:57:58 DEBUG manage [health] consul.get_session end: 19359e7e-37de-4bba-5c12-0669e1501e61
2016/12/09 13:57:58 DEBUG manage [health] consul.renew_session end: None
2016/12/09 13:57:58 DEBUG manage [health] mysql.query start
2016/12/09 13:57:58 DEBUG manage select 1 ()
2016/12/09 13:57:58 DEBUG manage [health] mysql.query end: [{u'1': 1}]
2016/12/09 13:57:58 DEBUG manage [health] consul.unlock_failover start
2016/12/09 13:57:58 DEBUG manage [health] consul.unlock_failover end
2016/12/09 13:57:58 DEBUG manage [health] node.health end
2016/12/09 13:57:58 mysql-primary.health.run complete
2016/12/09 13:57:58 mysql-primary.health.RunWithTimeout end

Note the section where the Consul agent is trying to rejoin. So we haven't totally solved the problem of why the Consul agent membership is flaky but we've narrowed that part of the problem down a bit.

I start to be able to replicate data, but am still unable to import my databases and replicate. I think it may possibly be data type issue. They are utf8 / utf8_bin tables and it seems when I create a new table of this kind and try to generate a slave it causes errors, but if I do a latin1 encoding it doesn't.

That sounds like it could be a fruitful place to explore. The only place I can think of where we've set a charset is in establishing the connection: https://github.com/autopilotpattern/mysql/blob/93fdf1ba0ed16cefd4bbf5e102ce3586229b6efe/bin/manager/libmysql.py#L101

tgross commented 7 years ago

So we haven't totally solved the problem of why the Consul agent membership is flaky but we've narrowed that part of the problem down a bit.

Ok I've figured that out now too. When we reload the config for the primary, the logic in manager/containerpilot.py includes the port 8500 in the -retry-join field which it should not. So we join Consul just fine and then keep trying to re-join on a port that doesn't work after the reload.

I've pushed a commit to https://github.com/autopilotpattern/mysql/pull/73 for that. I'm going to go thru and do a quick ContainerPilot update in the same PR to avoid a known but unrelated bug in 2.4.2.

tgross commented 7 years ago

I'm going to go thru and do a quick ContainerPilot update in the same PR to avoid a known but unrelated bug in 2.4.2.

Actually I'll do a separate PR for that.

tgross commented 7 years ago

@mbbender I've pushed a new autopilotpattern/mysql:latest image that includes both #73 and #74. Once those are merged I'll tag a release update for it. But try that so that we can eliminate the Consul stuff as a source of problems.

tgross commented 7 years ago

Released https://github.com/autopilotpattern/mysql/releases/tag/5.6r3.1.1 with those fixes. So now it's down to figuring out what you're seeing with the char types, perhaps. Can you try running with the new image and provide some details on how you're setting up the other tables?

mbbender commented 7 years ago

Hum... locally using this version I now do not get mysql to ever register with consul. I'll try on triton next. Here my logs from the mysql container locally.

https://gist.github.com/mbbender/88ee2bd2722c68cb6a0cc2b226a99dcb

mbbender commented 7 years ago

Same on triton. https://gist.github.com/mbbender/662baf8a595a3ab6cf6e561c1327f986

tgross commented 7 years ago

Ok, it looks like we have roughly the same sequence of events in both environments, although it's hard to tell because you don't have LOG_LEVEL=DEBUG on in the Triton deployment. Let's walk thru it:


In L528 we're re-writing the config to be the primary, as expected, and then reload the config in L532.

But in L536, which is within the same timestamp, our non-primary health check which was already running is writing a success to Consul. That would be a problematic race but we correct it after the reload is complete in L581 where we deregister the service before we restart the container in L588. So far so good.

Once we've restarted and the health checks start again we can see that the MySQL service advertisement is named mysql-primary in L620. The health check next tries to determine whether the node is primary. The logic for this is:


The very strange thing is that I can definitely replicate this behavior with the latest build now, so I'm wondering whether I had a polluted build environment before. I'm also wondering whether updating the ContainerPilot with the fixed health checking revealed an existing bug.

tgross commented 7 years ago

I'm also wondering whether updating the ContainerPilot with the fixed health checking revealed an existing bug.

I can now confirm this is true. Crap! Ok, looks like I need to tackle that code path a bit more. Open to suggestions if you have any.

mbbender commented 7 years ago

Not at the moment. I was holding off looking to deep into it until I heard from someone that knew what was going on that there was in fact an issue here and it wasn't just me... I'll try to take some time to look more into it later today or tomorrow depending on how far you get with it before then.

tgross commented 7 years ago

I think this is the fix we need: https://github.com/autopilotpattern/mysql/pull/75. Alas, no. Still looking

mbbender commented 7 years ago

One local run produced results described already. Another local run produced mysql-primary to show up and the key FAILOVER_IN_PROGRESS to be written and mysql-primary showing up as failing in consul. I did rebuild the image locally in between runs with --no-cache, but shouldn't have been any different. So perhaps some race condition occurring to produce two results. Logs https://gist.github.com/mbbender/db5abf229d0320983eb28bcea4842b51

mbbender commented 7 years ago

I also have noticed an Unrecognized character in my logs. I doubt this is related to the consul issue but just thought I'd call it out.

2016/12/13 02:13:38 DEBUG manage FLUSH PRIVILEGES; ()
2016-12-13 02:13:38 71 [Warning] 'proxies_priv' entry '@ root@66bf68ae0ae8' ignored in --skip-name-resolve mode.
2016/12/13 02:13:38 DEBUG manage [health] mysql.execute_many end
2016/12/13 02:13:38 DEBUG manage [health] node.write_snapshot start
2016/12/13 02:13:38 161213 02:13:38 innobackupex: Starting the backup operation
2016/12/13 02:13:38 
2016/12/13 02:13:38 IMPORTANT: Please check that the backup run completes successfully.
2016/12/13 02:13:38            At the end of a successful backup run innobackupex
2016/12/13 02:13:38            prints "completed OK!".
2016/12/13 02:13:38 
2016/12/13 02:13:38 Unrecognized character \x01; marked by <-- HERE after <-- HERE near column 1 at - line 1374.
2016/12/13 02:13:38 161213 02:13:38 Connecting to MySQL server host: localhost, user: repluser, password: set, port: 3306, socket: /var/run/mysqld/mysqld.sock
2016/12/13 02:13:38 Using server version 5.6.34-79.1-log
2016/12/13 02:13:38 /usr/bin/innobackupex version 2.3.6 based on MySQL server 5.6.24 Linux (x86_64) (revision id: 7686bfc)
2016/12/13 02:13:38 xtrabackup: uses posix_fadvise().
2016/12/13 02:13:38 xtrabackup: cd to /var/lib/mysql
2016/12/13 02:13:38 xtrabackup: open files limit requested 0, set to 1048576
2016/12/13 02:13:38 xtrabackup: using the following InnoDB configuration:
2016/12/13 02:13:38 xtrabackup:   innodb_data_home_dir = ./
2016/12/13 02:13:38 xtrabackup:   innodb_data_file_path = ibdata1:12M:autoextend
2016/12/13 02:13:38 xtrabackup:   innodb_log_group_home_dir = ./
2016/12/13 02:13:38 xtrabackup:   innodb_log_files_in_group = 2
2016/12/13 02:13:38 xtrabackup:   innodb_log_file_size = 50331648
2016/12/13 02:13:38 161213 02:13:38 >> log scanned up to (1632016)
2016/12/13 0
mbbender commented 7 years ago

@tgross Do you know if the TTL error that pops up matters?

Service not registered, registering... 2016/12/15 15:01:19 2016/12/15 15:01:19 [ERR] http: Request PUT /v1/agent/check/pass/mysql-f00f8c6d8526?note=ok, error: CheckID "mysql-f00f8c6d8526" does not have associated TTL from=127.0.0.1:36522

mbbender commented 7 years ago

Hey, just wondering if anyone has had a chance to look into this at all. I spent some time but can't seem to find any leads to chase.

tgross commented 7 years ago

I've been looking at this but not making a lot of progress as I've got another item on my plate too. It looks to me like logic problem in the state machine that determines whether an instance is primary or not, which was previously hidden by https://github.com/joyent/containerpilot/issues/223. Unfortunately I'm coming up on a holiday break here so I'm not going to be able to continue to hunt this down until after the new year.

sodre commented 7 years ago

@tgross I am facing a similar issue, with the logic of determining whether an instance is primary or not. In particular, I was able to pinpoint it to commit a8fa551 using consul:0.7r0.7 and consul:0.7.2r0.7.2.

Since that build, a brand new system does not work in my on-prem Triton install. The docker logs gets caught printing the following over and over again.

mysql_1   | 2017/01/16 22:46:33 mysql-primary.health.RunWithTimeout end
mysql_1   | 2017/01/16 22:46:38 mysql-primary.health.RunWithTimeout start
mysql_1   | 2017/01/16 22:46:38 mysql-primary.health.Cmd.Start
mysql_1   | 2017/01/16 22:46:38 mysql-primary.health.run waiting for PID 31291: 
mysql_1   | 2017/01/16 22:46:38 DEBUG manage [health] node.health start
mysql_1   | 2017/01/16 22:46:38 DEBUG manage [health] node.assert_initialized_for_state start
mysql_1   | 2017/01/16 22:46:38 DEBUG manage [health] node.assert_initialized_for_state end: True
mysql_1   | 2017/01/16 22:46:38 DEBUG manage [health] node.is_primary start
mysql_1   | 2017/01/16 22:46:38 DEBUG manage [health] mysql.get_primary start
mysql_1   | 2017/01/16 22:46:38 DEBUG manage [health] mysql.query start
mysql_1   | 2017/01/16 22:46:38 DEBUG manage [health] mysql.wait_for_connection start
mysql_1   | 2017/01/16 22:46:38 DEBUG manage [health] mysql.wait_for_connection end
mysql_1   | 2017/01/16 22:46:38 DEBUG manage show slave status ()
mysql_1   | 2017/01/16 22:46:38 DEBUG manage [health] mysql.query end: []
mysql_1   | 2017/01/16 22:46:38 DEBUG manage [health] mysql.query start
mysql_1   | 2017/01/16 22:46:38 DEBUG manage show slave hosts ()
mysql_1   | 2017/01/16 22:46:38 DEBUG manage [health] mysql.query end: []
mysql_1   | 2017/01/16 22:46:38 DEBUG manage could not determine primary via mysqld status: no prior replication setup found
mysql_1   | 2017/01/16 22:46:38 DEBUG manage [health] consul.get_primary start
mysql_1   | 2017/01/16 22:46:38 DEBUG manage []
mysql_1   | 2017/01/16 22:46:38 DEBUG manage could not determine primary via Consul: No primary found
mysql_1   | 2017/01/16 22:46:38 DEBUG manage [health] node.is_primary end: False
mysql_1   | 2017/01/16 22:46:38 DEBUG manage [health] node.is_primary start
mysql_1   | 2017/01/16 22:46:38 DEBUG manage [health] mysql.get_primary start
mysql_1   | 2017/01/16 22:46:38 DEBUG manage [health] mysql.query start
mysql_1   | 2017/01/16 22:46:38 DEBUG manage show slave status ()
mysql_1   | 2017/01/16 22:46:38 DEBUG manage [health] mysql.query end: []
mysql_1   | 2017/01/16 22:46:38 DEBUG manage [health] mysql.query start
mysql_1   | 2017/01/16 22:46:38 DEBUG manage show slave hosts ()
mysql_1   | 2017/01/16 22:46:38 DEBUG manage [health] mysql.query end: []
mysql_1   | 2017/01/16 22:46:38 DEBUG manage could not determine primary via mysqld status: no prior replication setup found
mysql_1   | 2017/01/16 22:46:38 DEBUG manage [health] consul.get_primary start
mysql_1   | 2017/01/16 22:46:38 DEBUG manage []
mysql_1   | 2017/01/16 22:46:38 DEBUG manage could not determine primary via Consul: No primary found
mysql_1   | 2017/01/16 22:46:38 DEBUG manage [health] node.is_primary end: False
mysql_1   | 2017/01/16 22:46:38 DEBUG manage [health] mysql.query start
mysql_1   | 2017/01/16 22:46:38 DEBUG manage show slave status ()
mysql_1   | 2017/01/16 22:46:38 DEBUG manage [health] mysql.query end: []
mysql_1   | 2017/01/16 22:46:38 ERROR manage Replica is not replicating.

I will try downgrading Containerpilot to see if it works.

tgross commented 7 years ago

I'm going to get this all building and testing on our CI system so that I can really drill into the bug from there. Should have that started later today.

mbbender commented 7 years ago

Awesome. Thanks. I'll be on standby if you need me to test anything out on my end.

tgross commented 7 years ago

So the good news is that I have the integration test working when I run it locally against Triton Cloud. My CI system is busted somewhere in how I have its Manta credentials set up. Got our CI creds fixed... so this is an output of that: https://gist.github.com/tgross/644854cc1e35727d8c70b5fe23a034fd

The bad news is of course that I have no idea why we're seeing this problem still. ☹️

----------------------------------------------------------------------
MySQLStackTest.test_replication_and_failover
----------------------------------------------------------------------
elapsed  | task
0.756753 | docker-compose -f docker-compose.yml -p my stop
0.661451 | docker-compose -f docker-compose.yml -p my rm -f
138.9547 | docker-compose -f docker-compose.yml -p my up -d
2.281949 | docker-compose -f docker-compose.yml -p my ps
0.718240 | docker inspect my_consul_1
51.99171 | wait_for_service: mysql-primary 1
2.389175 | docker-compose -f docker-compose.yml -p my ps -q mysql
2.982185 | docker exec 3aa7d2997c68c540af20a4b4e666a3051c9f556ede61e602d36b9ba34f94a2fd ip -o addr
5.409135 | assert_consul_correctness:
63.74707 | docker-compose -f docker-compose.yml -p my scale mysql=3
6.165413 | wait_for_service: mysql 2
1.828162 | docker-compose -f docker-compose.yml -p my ps -q mysql
2.492469 | docker exec 3aa7d2997c68c540af20a4b4e666a3051c9f556ede61e602d36b9ba34f94a2fd ip -o addr
1.612487 | docker exec 2ef369b637e3c7f5d3e6d979ec421f1dbca90510db66e344d018983fc6a84232 ip -o addr
2.551501 | docker exec 7ea5bfe0089e44aaa925f5917710f0b44f2f6d8e610fc2458064ee88f485ffde ip -o addr
8.570330 | assert_consul_correctness:
2.596569 | docker exec my_mysql_1 mysql -u mytestuser -pVD7MBOYOHL --vertical -e CREATE TABLE tbl1 (field1 INT, field2 VARCHAR(36)); mytestdb
2.345644 | docker exec my_mysql_1 mysql -u mytestuser -pVD7MBOYOHL --vertical -e INSERT INTO tbl1 (field1, field2) VALUES (1, "a7b3e4db-0828-475d-a975-e74679069f42"); mytestdb
2.460741 | docker exec my_mysql_1 mysql -u mytestuser -pVD7MBOYOHL --vertical -e INSERT INTO tbl1 (field1, field2) VALUES (1, "3c0bcb6d-4da4-4bc3-b9ba-92cea798c15e"); mytestdb
1.679266 | docker exec 2ef369b637e3 mysql -u mytestuser -pVD7MBOYOHL --vertical -e SELECT * FROM tbl1 WHERE `field1`=1; mytestdb
2.479666 | docker exec 7ea5bfe0089e mysql -u mytestuser -pVD7MBOYOHL --vertical -e SELECT * FROM tbl1 WHERE `field1`=1; mytestdb
10.56943 | docker stop my_mysql_1
0.019927 | wait_for_service: mysql-primary 1
1.792100 | docker-compose -f docker-compose.yml -p my ps -q mysql
1.667353 | docker exec 3aa7d2997c68c540af20a4b4e666a3051c9f556ede61e602d36b9ba34f94a2fd ip -o addr
1.573542 | docker exec 2ef369b637e3c7f5d3e6d979ec421f1dbca90510db66e344d018983fc6a84232 ip -o addr
2.540239 | docker exec 7ea5bfe0089e44aaa925f5917710f0b44f2f6d8e610fc2458064ee88f485ffde ip -o addr
7.613789 | assert_consul_correctness:
0.026989 | wait_for_service: mysql 1
1.474512 | docker-compose -f docker-compose.yml -p my ps -q mysql
1.679708 | docker exec 3aa7d2997c68c540af20a4b4e666a3051c9f556ede61e602d36b9ba34f94a2fd ip -o addr
1.533755 | docker exec 2ef369b637e3c7f5d3e6d979ec421f1dbca90510db66e344d018983fc6a84232 ip -o addr
2.392419 | docker exec 7ea5bfe0089e44aaa925f5917710f0b44f2f6d8e610fc2458064ee88f485ffde ip -o addr
7.120464 | assert_consul_correctness:
1.731336 | docker exec 2ef369b637e3 mysql -u mytestuser -pVD7MBOYOHL --vertical -e INSERT INTO tbl1 (field1, field2) VALUES (1, "dede9f82-4cac-4bf7-b760-76e33c04a612"); mytestdb
2.503307 | docker exec 7ea5bfe0089e mysql -u mytestuser -pVD7MBOYOHL --vertical -e SELECT * FROM tbl1 WHERE `field1`=1; mytestdb
.generating _env

----------------------------------------------------------------------
Ran 1 test in 319.778s

OK
tgross commented 7 years ago

Just FYI for those following this issue, I've just pushed this build to the Docker Hub as autopilotpattern/mysql:latest. Still trying to see if I can reproduce the original problem with this build.

tgross commented 7 years ago

I'm getting intermittent failures in our CI for the following during preStart:

2017/01/31 18:14:59 requests.exceptions.ConnectionError: HTTPConnectionPool(host='mysql-consul.svc.0f06a3e0-a0da-eb00-a7ae-989d4e44e2ad.us-sw-1.cns.joyent.com', port=8500): Max retries exceeded with url: /v1/kv/mysql-last-backup (Caused by NewConnectionError('<requests.packages.urllib3.connection.HTTPConnection object at 0x7ffffc887ad0>: Failed to establish a new connection: [Errno -2] Name or service not known',))

Which suggests that the timeouts on the preStart are not long enough.

Fixed this in https://github.com/autopilotpattern/mysql/commit/f8abadad7e8c10a5d09abbbadfcd3549d80b5319

mbbender commented 7 years ago

Did the timeouts solve the issue? I'll try to give it a shot on my side today or tomorrow.

tgross commented 7 years ago

I believe so, I haven't been able to replicate the problem since the last round of updates.

mbbender commented 7 years ago

Finally got to testing this. Looks like everything is healthy locally. I'll do a little more intensive work on it and try to get it running on triton before closing it out. Thanks.

mbbender commented 7 years ago

EDIT I'm going to open this as a new issue. I think replication is working as expected.

It's definitely getting past where it was before. I got the mysql primary to come up fine and saw replication working. However when I killed the master to test failover on the slave server I get:

Version: '5.6.34-79.1-log'  socket: '/var/run/mysqld/mysqld.sock'  port: 3306  Percona Server (GPL), Release 79.1, Revision 1c589f9
2017/03/12 22:58:54 INFO manage Setting up replication.
2017-03-12 22:58:54 58858 [Warning] Neither --relay-log nor --relay-log-index were used; so replication may break when this MySQL server acts as a slave and has his hostname changed!! Please use '--relay-log=mysqld-relay-bin' to avoid this problem.
2017-03-12 22:58:54 58858 [Note] 'CHANGE MASTER TO executed'. Previous state master_host='', master_port= 3306, master_log_file='', master_log_pos= 4, master_bind=''. New state master_host='192.168.129.171', master_port= 3306, master_log_file='', master_log_pos= 4, master_bind=''.
2017-03-12 22:58:54 58858 [Warning] Storing MySQL user name or password information in the master info repository is not secure and is therefore not recommended. Please consider using the USER and PASSWORD connection options for START SLAVE; see the 'START SLAVE Syntax' in the MySQL Manual for more information.
2017-03-12 22:58:54 58858 [Warning] Slave SQL: If a crash happens this configuration does not guarantee that the relay log info will be consistent, Error_code: 0
2017-03-12 22:58:54 58858 [Note] Slave SQL thread initialized, starting replication in log 'FIRST' at position 0, relay log './mysqld-relay-bin.000001' position: 4
2017-03-12 22:58:54 58858 [Note] Slave I/O thread: connected to master 'repluser@192.168.129.171:3306',replication started in log 'FIRST' at position 4
2017/03/12 22:58:54     2017/03/12 22:58:54 [ERR] http: Request PUT /v1/agent/check/pass/mysql-3709f21efad3?note=ok, error: CheckID "mysql-3709f21efad3" does not have associated TTL from=127.0.0.1:45860
2017/03/12 22:58:54 Unexpected response code: 500 (CheckID "mysql-3709f21efad3" does not have associated TTL)
Service not registered, registering...
2017/03/12 22:58:54     2017/03/12 22:58:54 [INFO] agent: Synced service 'mysql-3709f21efad3'
2017/03/12 22:58:54     2017/03/12 22:58:54 [INFO] agent: Synced check 'mysql-3709f21efad3'
2017/03/12 22:58:54     2017/03/12 22:58:54 [INFO] agent: Synced check 'mysql-3709f21efad3'
2017/03/12 23:04:39 INFO manage [on_change] Executing failover with candidates: [u'192.168.129.173']
2017-03-12 23:04:39 58858 [Note] Error reading relay log event: slave SQL thread was killed
2017-03-12 23:04:39 58858 [Note] Slave I/O thread killed while reading event
2017-03-12 23:04:39 58858 [Note] Slave I/O thread exiting, read up to log 'mysql-bin.000001', position 1209
2017/03/12 23:04:39 WARNING: Using a password on the command line interface can be insecure.
2017/03/12 23:04:39 # Checking privileges.
2017/03/12 23:04:39 # Checking privileges on candidates.
2017/03/12 23:04:39 # Performing failover.
2017/03/12 23:04:39 # Candidate slave 192.168.129.173:3306 will become the new master.
2017/03/12 23:04:39 # Checking slaves status (before failover).
2017/03/12 23:04:39 # Preparing candidate for failover.
2017/03/12 23:04:39 # Creating replication user if it does not exist.
2017/03/12 23:04:39 # Stopping slaves.
2017/03/12 23:04:39 # Performing STOP on all slaves.
2017/03/12 23:04:39 # Switching slaves to new master.
2017/03/12 23:04:39 # Disconnecting new master as slave.
2017/03/12 23:04:39 # Starting slaves.
2017/03/12 23:04:39 # Performing START on all slaves.
2017/03/12 23:04:39 # Checking slaves for errors.
2017/03/12 23:04:39 # Failover complete.
2017/03/12 23:04:39 #
2017/03/12 23:04:39 # Replication Topology Health:
2017/03/12 23:04:39 +------------------+-------+---------+--------+------------+---------+
2017/03/12 23:04:39 | host             | port  | role    | state  | gtid_mode  | health  |
2017/03/12 23:04:39 +------------------+-------+---------+--------+------------+---------+
2017/03/12 23:04:39 | 192.168.129.173  | 3306  | MASTER  | UP     | ON         | OK      |
2017/03/12 23:04:39 +------------------+-------+---------+--------+------------+---------+
2017/03/12 23:04:39 # ...done.
2017/03/12 23:04:40 ERROR manage [on_change] this node is neither primary or replica after failover; check replication status on cluster.
2017/03/12 23:04:44 ERROR manage Cannot determine MySQL state; failing health check.
2017/03/12 23:04:45     2017/03/12 23:04:45 [INFO] memberlist: Suspect a5b06f708612 has failed, no acks received
2017/03/12 23:04:48     2017/03/12 23:04:48 [INFO] memberlist: Suspect a5b06f708612 has failed, no acks received
2017/03/12 23:04:49 ERROR manage Cannot determine MySQL state; failing health check.
2017/03/12 23:04:50     2017/03/12 23:04:50 [INFO] serf: EventMemberFailed: a5b06f708612 192.168.129.171
2017/03/12 23:04:50     2017/03/12 23:04:50 [INFO] memberlist: Suspect a5b06f708612 has failed, no acks received
2017/03/12 23:04:54 ERROR manage Cannot determine MySQL state; failing health check.
2017/03/12 23:04:59 ERROR manage Cannot determine MySQL state; failing health check.
2017/03/12 23:05:04 ERROR manage Cannot determine MySQL state; failing health check.
2017/03/12 23:05:04     2017/03/12 23:05:04 [WARN] agent: Check 'mysql-3709f21efad3' missed TTL, is now critical
2017/03/12 23:05:04     2017/03/12 23:05:04 [INFO] agent: Synced check 'mysql-3709f21efad3'
2017/03/12 23:05:09 ERROR manage Cannot determine MySQL state; failing health check.
2017/03/12 23:05:14 ERROR manage Cannot determine MySQL state; failing health check.
2017/03/12 23:05:19 ERROR manage Cannot determine MySQL state; failing health check.
2017/03/12 23:05:24 ERROR manage Cannot determine MySQL state; failing health check.
2017/03/12 23:05:29 ERROR manage Cannot determine MySQL state; failing health check.
2017/03/12 23:05:34 ERROR manage Cannot determine MySQL state; failing health check.
2017/03/12 23:05:39 ERROR manage Cannot determine MySQL state; failing health check.
2017/03/12 23:05:44 ERROR manage Cannot determine MySQL state; failing health check.
2017/03/12 23:05:48     2017/03/12 23:05:48 [INFO] serf: attempting reconnect to a5b06f708612 192.168.129.171:8301
2017/03/12 23:05:49 ERROR manage Cannot determine MySQL state; failing health check.
2017/03/12 23:05:54 ERROR manage Cannot determine MySQL state; failing health check.
2017/03/12 23:05:59 ERROR manage Cannot determine MySQL state; failing health check.
2017/03/12 23:06:04 ERROR manage Cannot determine MySQL state; failing health check.
2017/03/12 23:06:09 ERROR manage Cannot determine MySQL state; failing health check.
2017/03/12 23:06:14 ERROR manage Cannot determine MySQL state; failing health check.
2017/03/12 23:06:19 ERROR manage Cannot determine MySQL state; failing health check.
2017/03/12 23:06:24 ERROR manage Cannot determine MySQL state; failing health check.
2017/03/12 23:06:29 ERROR manage Cannot determine MySQL state; failing health check.
2017/03/12 23:06:34 ERROR manage Cannot determine MySQL state; failing health check.
2017/03/12 23:06:39 ERROR manage Cannot determine MySQL state; failing health check.
2017/03/12 23:06:44 ERROR manage Cannot determine MySQL state; failing health check.
2017/03/12 23:06:49 ERROR manage Cannot determine MySQL state; failing health check.
2017/03/12 23:06:54 ERROR manage Cannot determine MySQL state; failing health check.
2017/03/12 23:06:58     2017/03/12 23:06:58 [INFO] serf: attempting reconnect to a5b06f708612 192.168.129.171:8301
2017/03/12 23:06:59 ERROR manage Cannot determine MySQL state; failing health check.
2017/03/12 23:07:04 ERROR manage Cannot determine MySQL state; failing health check.
2017/03/12 23:07:09 ERROR manage Cannot determine MySQL state; failing health check.
2017/03/12 23:07:14 ERROR manage Cannot determine MySQL state; failing health check.
2017/03/12 23:07:19 ERROR manage Cannot determine MySQL state; failing health check.
2017/03/12 23:07:24 ERROR manage Cannot determine MySQL state; failing health check.
2017/03/12 23:07:29 ERROR manage Cannot determine MySQL state; failing health check.
2017/03/12 23:07:34 ERROR manage Cannot determine MySQL state; failing health check.
2017/03/12 23:07:38     2017/03/12 23:07:38 [INFO] serf: attempting reconnect to a5b06f708612 192.168.129.171:8301
2017/03/12 23:07:39 ERROR manage Cannot determine MySQL state; failing health check.
2017/03/12 23:07:44 ERROR manage Cannot determine MySQL state; failing health check.
2017/03/12 23:07:49 ERROR manage Cannot determine MySQL state; failing health check.
2017/03/12 23:07:54 ERROR manage Cannot determine MySQL state; failing health check.
2017/03/12 23:07:59 ERROR manage Cannot determine MySQL state; failing health check.
2017/03/12 23:08:04 ERROR manage Cannot determine MySQL state; failing health check.
2017/03/12 23:08:09 ERROR manage Cannot determine MySQL state; failing health check.
2017/03/12 23:08:14 ERROR manage Cannot determine MySQL state; failing health check.
2017/03/12 23:08:19 ERROR manage Cannot determine MySQL state; failing health check.
2017/03/12 23:08:24 ERROR manage Cannot determine MySQL state; failing health check.
2017/03/12 23:08:29 ERROR manage Cannot determine MySQL state; failing health check.
2017/03/12 23:08:34 ERROR manage Cannot determine MySQL state; failing health check.
2017/03/12 23:08:39 ERROR manage Cannot determine MySQL state; failing health check.
2017/03/12 23:08:44 ERROR manage Cannot determine MySQL state; failing health check.
2017/03/12 23:08:49 ERROR manage Cannot determine MySQL state; failing health check.
2017/03/12 23:08:54 ERROR manage Cannot determine MySQL state; failing health check.