Consul server didn't recover when clients were also down

youngm commented 8 years ago

We experienced a partial datacenter outage recently and while bringing things back up I discovered that I couldn't bring one of my 3 consul servers back up. From looking at the logs it appears that the server couldn't come up because some of the clients were down.

Once all the clients recovered then the server was able to start.

I haven't yet attempted to duplicate the issue. If I find the time I will update the issue with some exact steps. I imagine I could duplicate the issue by disconnecting some clients from the network and then crashing a node of a 3 node cluster.

Here are the applicable logs:

{"timestamp":"1466775061.845097780","source":"confab","message":"confab.agent-client.is-last-node.result","log_level":1,"data":{"actual_members_count":3,"expected_members_count":3,"is_last_node":true}}
{"timestamp":"1466775061.845248222","source":"confab","message":"confab.controller.configure-server.verify-synced","log_level":1,"data":{}}
{"timestamp":"1466775061.845463514","source":"confab","message":"confab.agent-client.verify-synced.stats.request","log_level":1,"data":{}}
{"timestamp":"1466775061.845934629","source":"confab","message":"confab.agent-client.verify-synced.stats.response","log_level":1,"data":{"commit_index":"362812","last_log_index":"362812"}}
{"timestamp":"1466775061.846092939","source":"confab","message":"confab.agent-client.verify-synced.synced","log_level":1,"data":{}}
{"timestamp":"1466775061.846226931","source":"confab","message":"confab.controller.configure-server.set-keys","log_level":1,"data":{"keys":["--deleted--"]}}
{"timestamp":"1466775061.846373320","source":"confab","message":"confab.agent-client.set-keys.list-keys.request","log_level":1,"data":{}}
    2016/06/24 13:31:01 [INFO] serf: Received list-keys query
    2016/06/24 13:31:01 [INFO] serf: Received list-keys query
{"timestamp":"1466775066.457789183","source":"confab","message":"confab.agent-client.set-keys.list-keys.response","log_level":1,"data":{"keys":["--deleted--"]}}
{"timestamp":"1466775066.457836866","source":"confab","message":"confab.agent-client.set-keys.install-key.request","log_level":1,"data":{"key":"--deleted--"}}
    2016/06/24 13:31:06 [INFO] serf: Received install-key query
    2016/06/24 13:31:06 [INFO] serf: Received install-key query
{"timestamp":"1466775070.503915071","source":"confab","message":"confab.agent-client.set-keys.install-key.request.failed","log_level":2,"data":{"error":"1/108 nodes reported failure","key":"--deleted--"}}
{"timestamp":"1466775070.503953934","source":"confab","message":"confab.controller.configure-server.set-keys.failed","log_level":2,"data":{"error":"1/108 nodes reported failure","keys":["--deleted--"]}}
==> /var/vcap/sys/log/consul_agent/consul_agent.stderr.log <==
error during start: 1/108 nodes reported failure
==> /var/vcap/sys/log/consul_agent/consul_agent.stdout.log <==
    2016/06/24 13:31:10 [INFO] agent.rpc: Accepted client: 127.0.0.1:37087
{"timestamp":"1466775070.505033970","source":"confab","message":"confab.controller.stop-agent.leave","log_level":1,"data":{}}
{"timestamp":"1466775070.505050182","source":"confab","message":"confab.agent-client.leave.leave.request","log_level":1,"data":{}}
    2016/06/24 13:31:10 [INFO] agent.rpc: Graceful leave triggered
    2016/06/24 13:31:10 [INFO] consul: server starting leave
    2016/06/24 13:31:10 [INFO] serf: EventMemberLeave: etcd-pz2-0.dc1 --deleted--
    2016/06/24 13:31:10 [INFO] consul: removing server etcd-pz2-0.dc1 (Addr: --deleted--:8300) (DC: dc1)
    2016/06/24 13:31:11 [INFO] serf: EventMemberLeave: etcd-pz2-0 10.61.137.53
    2016/06/24 13:31:11 [INFO] consul: removing server etcd-pz2-0 (Addr: --deleted--:8300) (DC: dc1)
    2016/06/24 13:31:11 [INFO] raft: Removed ourself, transitioning to follower
{"timestamp":"1466775072.029870272","source":"confab","message":"confab.agent-client.leave.leave.response","log_level":1,"data":{}}
{"timestamp":"1466775072.029910088","source":"confab","message":"confab.controller.stop-agent.wait","log_level":1,"data":{}}
{"timestamp":"1466775072.029917955","source":"confab","message":"confab.agent-runner.wait.get-process","log_level":1,"data":{}}
{"timestamp":"1466775072.029926062","source":"confab","message":"confab.agent-runner.wait.get-process.result","log_level":1,"data":{"pid":5852}}
{"timestamp":"1466775072.029937506","source":"confab","message":"confab.agent-runner.wait.signal","log_level":1,"data":{"pid":5852}}
    2016/06/24 13:31:12 [INFO] agent: requesting shutdown
    2016/06/24 13:31:12 [INFO] consul: shutting down server
    2016/06/24 13:31:12 [INFO] agent: shutdown complete

cf-gitbot commented 8 years ago

We have created an issue in Pivotal Tracker to manage this:

https://www.pivotaltracker.com/story/show/122215983

The labels on this github issue will be updated when the story is started.

christianang commented 8 years ago

Hi @youngm,

Can you give more details about what happened during this partial datacenter outage? Did all the VMs with consul agents go down? Did the entire consul cluster go down?

When you say all the clients recovered, did they recover by you redeploying those clients?

I'm also curious as to what node reported a failure. How many clients were down when you tried to start the cluster and if there were any down can you get any relevant logs from their consul agents?

From just these logs it is hard to tell why it failed.

Thanks, Christian and @kkallday

youngm commented 8 years ago

@christianang @kkallday:

What happened during this outage:

About 1/4 of our servers filesystem switched to a Read Only mode causing all kinds of problems.

Did all the VMs with consul agents go down?

We have a 3 node server cluster with about 100 agents. One server was affected (the one I couldn't bring up) and about 1/4 of the agents were also affected. The cluster stayed up since I still had 2 servers up.

When you say all the clients recovered, did they recover by you redeploying those clients?

The clients recovered just fine by simply rebooting the VMs they were on (What the infrastructure team told us to do to recover from this Read Only file system problem). One of the VMs impacted by this Read Only file system was a consul server node. On this VM after restarting the VM the consul server appeared to refused to join the cluster until all of the clients were functioning properly.

I'm also curious as to what node reported a failure. How many clients were down when you tried to start the cluster and if there were any down can you get any relevant logs from their consul agents?

Though I cannot track the exact client that failed in the log snippet I provided above I do see the following errors occurred frequently on a number of clients with issues:

2016/06/24 13:26:21 [INFO] consul: adding server etcd-pz2-0 (Addr: --deleted--:8300) (DC: dc1)
2016/06/24 13:26:29 [ERR] serf: Failed to write keyring file: Failed to write keyring file: open /var/vcap/store/consul_agent/serf/local.keyring: read-only file system
2016/06/24 13:26:29 [ERR] serf: Failed to write keyring file: Failed to write keyring file: open /var/vcap/store/consul_agent/serf/local.keyring: read-only file system
2016/06/24 13:26:30 [ERR] serf: Failed to update snapshot: write /var/vcap/store/consul_agent/serf/local.snapshot: read-only file system

With that data the problem seem pretty apparent to me. The server was attempting to send keys to the cluster and some of the clients, though running and connected to the cluster, couldn't write the key to the file system so it returned failure which caused the server to not start.

The question is, is it possible for the server to still join the cluster even though it may fail to write its key to all the clients? We eventually restarted those clients and all worked again. It just surprised me that I couldn't start the server until all those clients were fully functional.

christianang commented 8 years ago

@youngm:

Your consul deployment should fail if any clients return an error because this means something is wrong with your deployment and we don't want to ignore that. In your case, if we don't report that a client returned an error it could be difficult to later realize why your deployment is not working properly. It might also partition your client from the rest of the cluster, which could cause problems.

Christian Ang @kkallday @valeriap

youngm commented 8 years ago

@kkallday @valeriap Sound's good. I know very little about consul and will probably never have an issue like this again. But, the behavior seemed odd to me so I thought I'd raise it. If this behavior is by design then I don't have a problem with it. Thanks.

christianang commented 8 years ago

Yeah it does seem a little strange on first glance and we did have to have some discussion to arrive to our conclusion, but we feel it makes sense. Thanks. I'll be closing this issue now; feel free to open another issue if needed.

dlapiduz commented 8 years ago

@christianang one question though, how do you know which node is failing?

Thanks!

christianang commented 8 years ago

@dlapiduz unfortunately consul doesn't provide a good way to tell which node is causing the failure based on the logs of the server. The only option is to go through the client logs and look for errors that occurred.

cloudfoundry-attic / consul-release

Consul server didn't recover when clients were also down #27