jaresty commented 7 years ago

The Consul server can fail when another VM's disk size is the root cause of the problem

In our upgrade environment for CF deployment, we had a failing consul VM. It was failing with logs like this:

{"timestamp":"1490896140.416868925","source":"confab","message":"confab.agent-client.set-keys.install-key.request.failed","log_level":2,"data":{"error":"Unexpected response code: 500 (1 error(s) occurred:\n\n* 2/32 nodes reported failure)","key":"/ZgtX2dr6HYLZ9OlCfMQOA=="}}

This is problematic, because consul comes first

Bosh will happily adjust the disk size of its instance groups when asked to, but it updates them in order. Since the cause of the consul failure was actually on another VM, bosh does not know how to resolve this problem on its own. It will try to update the consul VM first, since it is first in the manifest, and that update will fail no matter what you do to adjust the disk size of later jobs. As a consequence, with the new bosh CLI which does not have a way to --force to recreate a VM based on an un-deployed manifest, the only way to fix this problem is to rearrange your manifest so that the instance group that needs disk size adjusting comes before the consul instance group.

Can we fix consul so it succeeds even if the agents on other VMs are returning 500s?

This worked for us once, but it seems like a brittle solution that is not safe to recommend in all cases. For example, there is no guarantee that all instance groups could be moved ahead of consul and result in a successful deployment. What if those jobs needed to come later in order to succeed?

cc: @anEXPer @dsabeti @mingxiao

cf-gitbot commented 7 years ago

We have created an issue in Pivotal Tracker to manage this:

https://www.pivotaltracker.com/story/show/142807575

The labels on this github issue will be updated when the story is started.

evanfarrar commented 7 years ago

I'm not certain we should change the behavior of the consul server here, in this frustrating error is a nugget of good behavior. You are alerted to the fact that a new key will not be inserted into the cluster and that you are effectively partitioning your consul cluster if you continue. If the operator is ok with that out-of-space VM losing the ability to find peers or advertise services for discovery then it seems they should also be OK with that VM being destroyed or the monit being stopped. I think killing or monit stopping the out-of-space VMs and redeploying with more space would have resolved the stuck situation you were in.

dsabeti commented 7 years ago

It sounds like the BOSH team is updating the recreate behavior, so that should be the way to recover from this problem.

evanfarrar commented 7 years ago

I don't think we're currently planning any action to fix this. Perhaps we could come up with a way to better explain the nature of this error by parsing consul's logs and inserting more messages about why this error is stopping deployment, but that would be a significant change to a project that we're currently trying to phase out in CF. I will add this to failure recovery instructions, though.

cloudfoundry-attic / consul-release

When a VM with no disk space has a consul agent, the consul server fails #66

The Consul server can fail when another VM's disk size is the root cause of the problem

This is problematic, because consul comes first

Can we fix consul so it succeeds even if the agents on other VMs are returning 500s?