cloudfoundry-attic / etcd-release

Apache License 2.0
3 stars 17 forks source link

Alternative solution to serial etcd startup #20

Closed alexjh closed 8 years ago

alexjh commented 8 years ago

Filing an issue here for discussing rather than conversations going missing on Slack.

In our configuration, etcd containers start simultaneously, but this hits a race condition in https://github.com/cloudfoundry-incubator/etcd-release/blob/master/jobs/etcd/templates/etcd_ctl.erb#L62 where all of the processes start in the 'new' state, and they don't cluster.

This is handled in the bosh release by making the instances run serially: https://github.com/cloudfoundry-incubator/etcd-release/blob/master/manifests/aws/multi-az-ssl.yml#L102

I'd like to suggest an alternative method of having a single node be nominated as the bootstrap node (eg properties.etcd.bootstrap_node), and if defined, it would always be the etcd process that starts with cluster_state=new, and the other nodes would poll the member list until it is up.

Any input is appreciated as I don't have much experience in how nodes and indexes are handled in multi-AZ setups.

cf-gitbot commented 8 years ago

We have created an issue in Pivotal Tracker to manage this:

https://www.pivotaltracker.com/story/show/119561927

The labels on this github issue will be updated when the story is started.

Amit-PivotalLabs commented 8 years ago

BOSH natively already supports a "bootstrap" semantics which we could leverage: https://github.com/cloudfoundry/bosh-notes/blob/7f7fff5d58f4cdde79f7c2c63506d51824bdc151/finished/availability-zones.md#cli-changes--job-instance-indexing. Let me give it some thought and get back to you.

Amit-PivotalLabs commented 8 years ago

@alexjh

This approach doesn't handle the case where, after the cluster is up, if the bootstrap node is rolled then it will start back up in bootstrap mode rather than joining the rest of the cluster, which is undesirable (and will lead to split-brain).

You also generally don't want to have all the things start in parallel (by setting max-in-flight to the number of instances) because you will have downtime whenever you update the cluster.

alexjh commented 8 years ago

@Amit-PivotalLabs, thanks for pointing out the flaw with rolling the bootstrap node.

Do you think it's reasonable to ensure that non-bootstrap nodes can never start etcd in a new state? What do you think of wrapping the new state in a check for the bootstrap node, but if it's rebooted, it will keep the same behaviour of checking the current member list?

https://github.com/cloudfoundry-incubator/etcd-release/compare/master...alexjh:new-state-only-on-bootstrap-nodes

I agree 100% about doing rolling updates to a cluster, our setup does rolling updates, but the initial startup is in parallel.

Amit-PivotalLabs commented 8 years ago

I think what you suggested sound reasonable:

Do you think it's reasonable to ensure that non-bootstrap nodes can never start etcd in a new state?

The code in the fork seems too complex though. etcd-release is hard to maintain because there's so much untested bash + ERB logic orchestrating a RAFT cluster. We have plans to improve this by replacing the untestable code with tested Go code, but we'd love to be able to get to that epic of work sooner. If you or your organization is interested in accelerating improvements to etcd-release and otherwise contributing to the CF Infrastructure team, I'll shamelessly plug the Cloud Foundry Foundation dojo program here ;).

Getting back to your suggestion: interleaving the ERB ifs with the bash ifs is going to be tough, plus the polling logic seems unnecessary. I would be okay with it doing just exactly what you said, i.e. if it's about to start in a new state, but it's not the bootstrap node, don't start. It should fail, and let monit/BOSH worry about retrying it. Something like:

if [ -z "${prior_member_list}" ]; then
  <% if spec.bootstrap == false %>
  echo "FATAL: This is not the bootstrap node, prior members of the cluster should exist but none were found"
  exit 1
  <% end %>

  cluster_state=new
  cluster="${this_node}"
else

Would that change help?

Amit-PivotalLabs commented 8 years ago

I'm going to close out this thread for now, feel free to let me know if you want to discuss further.