Closed alexjh closed 8 years ago
We have created an issue in Pivotal Tracker to manage this:
https://www.pivotaltracker.com/story/show/119561927
The labels on this github issue will be updated when the story is started.
BOSH natively already supports a "bootstrap" semantics which we could leverage: https://github.com/cloudfoundry/bosh-notes/blob/7f7fff5d58f4cdde79f7c2c63506d51824bdc151/finished/availability-zones.md#cli-changes--job-instance-indexing. Let me give it some thought and get back to you.
@alexjh
This approach doesn't handle the case where, after the cluster is up, if the bootstrap
node is rolled then it will start back up in bootstrap mode rather than joining the rest of the cluster, which is undesirable (and will lead to split-brain).
You also generally don't want to have all the things start in parallel (by setting max-in-flight to the number of instances) because you will have downtime whenever you update the cluster.
@Amit-PivotalLabs, thanks for pointing out the flaw with rolling the bootstrap node.
Do you think it's reasonable to ensure that non-bootstrap nodes can never start etcd in a new
state? What do you think of wrapping the new state in a check for the bootstrap node, but if it's rebooted, it will keep the same behaviour of checking the current member list?
I agree 100% about doing rolling updates to a cluster, our setup does rolling updates, but the initial startup is in parallel.
I think what you suggested sound reasonable:
Do you think it's reasonable to ensure that non-bootstrap nodes can never start etcd in a
new
state?
The code in the fork seems too complex though. etcd-release
is hard to maintain because there's so much untested bash + ERB logic orchestrating a RAFT cluster. We have plans to improve this by replacing the untestable code with tested Go code, but we'd love to be able to get to that epic of work sooner. If you or your organization is interested in accelerating improvements to etcd-release
and otherwise contributing to the CF Infrastructure team, I'll shamelessly plug the Cloud Foundry Foundation dojo program here ;).
Getting back to your suggestion: interleaving the ERB if
s with the bash
if
s is going to be tough, plus the polling logic seems unnecessary. I would be okay with it doing just exactly what you said, i.e. if it's about to start in a new
state, but it's not the bootstrap node, don't start. It should fail, and let monit/BOSH worry about retrying it. Something like:
if [ -z "${prior_member_list}" ]; then
<% if spec.bootstrap == false %>
echo "FATAL: This is not the bootstrap node, prior members of the cluster should exist but none were found"
exit 1
<% end %>
cluster_state=new
cluster="${this_node}"
else
Would that change help?
I'm going to close out this thread for now, feel free to let me know if you want to discuss further.
Filing an issue here for discussing rather than conversations going missing on Slack.
In our configuration, etcd containers start simultaneously, but this hits a race condition in https://github.com/cloudfoundry-incubator/etcd-release/blob/master/jobs/etcd/templates/etcd_ctl.erb#L62 where all of the processes start in the 'new' state, and they don't cluster.
This is handled in the bosh release by making the instances run serially: https://github.com/cloudfoundry-incubator/etcd-release/blob/master/manifests/aws/multi-az-ssl.yml#L102
I'd like to suggest an alternative method of having a single node be nominated as the bootstrap node (eg
properties.etcd.bootstrap_node
), and if defined, it would always be the etcd process that starts withcluster_state=new
, and the other nodes would poll the member list until it is up.Any input is appreciated as I don't have much experience in how nodes and indexes are handled in multi-AZ setups.