Closed aktarali closed 8 years ago
@aktarali generally speaking, it’s up to you as the operator to create that file with the appropriate value inside. There are a few ways to handle that:
myid
at initial bootstrap
exhibitor
myid
for you
exhibitor
cookbook, which actually wraps this cookbook, to manage your ZK clusterWe use Exhibitor at EverTrue, and I believe the Simple folks use it too, since they initially wrote the exhibitor
cookbook.
Guys;
@jeffbyrnes @asenchi @daveyeu @3n @solarce The method used by EverTrue and SimpleFinance does not seem truly dynamic. For that reason, I have been playing with using Chef CMDB (which is its actual reason for existing - to allow storage of node attributes) to grab the valued necessary for both zoo.cfg and myid.
My ruby and DSL skills are still limited hence why I have opted to use BASH. Is there anyway you guys would consider adding this functionality to master branch functionality. This method is working so good that I think its worth considering. Here is what I've added from all your help:
Created a wrapper cookbook: Attributes:
# set up zk attributes
default[:zookeeper][:config] = {
clientPort: 2181,
dataDir: '/var/lib/zookeeper',
tickTime: 2000,
initLimit: 11,
syncLimit: 17
}
# recipe: zk.rb
i = 0
search(
:node,
"roles:zookeeper AND chef_environment:#{node.chef_environment}").sort.map do |n|
i += 1
node.set['zookeeper']['config']["server.#{i}"] = "#{n['fqdn']}:2888:3888"
end
bash 'add_myid' do
code <<-EOH
cat `find / \! -type l -name zoo.cfg`|grep `hostname`|cut -d"=" -f1|cut -d"." -f2 > /var/lib/zookeeper/myid
chown -R zookeeper:zookeeper /var/lib/zookeeper
rm -rf /etc/zookeeper/conf/zoo.cfg; ln -s /opt/zookeeper/zookeeper-3.4.6/conf/zoo.cfg /etc/zookeeper/conf/zoo.cfg
zookeeper-server-initialize --myid=`cat /var/lib/zookeeper/myid` --force
EOH
end
I could have added this to the same zk.rb file, but did it in a separate file to restart zk service, could have been handled differently I think, but I'm sure you clever guys can come up with a better method.
zk_service:
bash 'restart_zookeeper' do
code <<-EOH
chown -R zookeeper:zookeeper /var/lib/zookeeper
service zookeeper stop
sleep 3
service zookeeper start
EOH
end
These modifications allow for zoo.cfg
values to be identical on all nodes in the cluster, allow for myid to be grabbed from zoo.cfg and service reloaded. This is now working in a fully dynamic way without any external s3 configs or manual process or databag.
Your contributions and thoughts on this are welcome.
Ali.
@aktarali hope you don’t mind, I added some code block formatting to make that a bit easier to read.
Reworked in a more Chef-like fashion:
# attributes
set[:zookeeper][:config][:initLimit] = 11
set[:zookeeper][:config][:syncLimit] = 17
# recipe
i = 0
search(
:node,
"roles:zookeeper AND chef_environment:#{node.chef_environment}"
).sort.map do |n|
i += 1
node.set['zookeeper']['config']["server.#{i}"] = "#{n['fqdn']}:2888:3888"
end
include_recipe 'zookeeper::default'
file '/var/lib/zookeeper/myid' do
content node['zookeeper']['config'].key("#{node['fqdn']}:2888:3888").sub 'server.', ''
owner 'zookeeper'
group 'zookeeper'
end
include_recipe 'zookeeper::service'
For the attributes, in your wrapper, you’ll want to use set
(aka normal
) not default
precedence level, so you properly override the defaults in this cookbook. Additionally, you only need to set the ones that are different; otherwise, leave the defaults alone.
Then we find our ZK nodes & set the necessary attributes for the ZK config.
Then we install & configure ZK using the supplied zookeeper::default
recipe.
Once we’ve got all the necessary files & directories, we create our myid
file, searching the attributes to find this particular server’s ID. In this way, we ensure we use the correct ID for this particular server.
Rather than fight with the cookbook on where the config file lives with this symlinking dance, best to just use the supplied zoo.cfg
, as it will be configured the way you want it to be, so we can omit all of that.
Finally, we include the service recipe, which actually sets up & starts the service itself, starting ZooKeeper.
There’s no need to restart ZK in your recipe (nor would you want to in live environment, owing to the potential for data loss), since you don’t start the service until after you’ve configured things. If you added a new ZK node, or removed one, you’d want to reconverge the existing nodes, then manually restart them in a rolling fashion to pick up the changes.
@aktarali I’d be inclined to agree w/ @williamsjj, using Exhibitor is probably the strongest solution; I can say definitively that it’s given us a rock solid ZK infrastructure.
@aktarali @jeffbyrnes It takes Zookeeper a somewhat long time as it is to settle down when nodes join and leave...adding 15-30minutes of latency to that due to Chef runs would make the interval really painful.
@williamsjj oh man, so true. Even with Exhibitor, it can take a little while to calm down after a change.
Hi Guys;
Completely agree with all comments bar one. What I've tried doing is removing the need to have any dependency on another tool/filestore/file external to chef. Some scenarios might explain my rationale:
1: When you first bring up a cluster - 3 nodes say. Depending on how often you run chef-client (I run it every 15 min). After the initial bootstrap, I expect my cluster to be fully up and running within 15min. 2: Any new masters joining will join cluster with the existing configs from chef-server. After the next run in 15 min, the other 3 will have the new node configs, after another 15 min, the new node will get its config from chef-server and complete its zoo.cfg. Since 3 or 5 or 7 is recommended for a good cluster setup, the quorum will only need changing when your server requirements reaches 5 or 7.
I know the time it takes for the initial cluster to come up could be a lot quicker with a filestore. Subsequent addition to the cluster may take longer. However, if rolling-in the new hosts to join an existing cluster is not time constrained then I'm happy with it taking its time. I have in this manner eliminated another dependency.
Really appreciate your help in this regard and all the feedback you guys have been providing. Really awesome ppl to work with.
Cheers Ali.
On 1 April 2016 at 20:35, Jeff Byrnes notifications@github.com wrote:
@williamsjj https://github.com/williamsjj oh man, so true. Even with Exhibitor, it can take a little while to calm down after a change.
— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub https://github.com/SimpleFinance/chef-zookeeper/issues/162#issuecomment-204537665
Ali Aktar Zircon Solutions LTD
Hi Jeff;
I've set the code as per your instructions:
Attributes:
# set up mesos attribute
default[:zookeeper][:config] = {
clientPort: 2181,
dataDir: '/var/lib/zookeeper',
tickTime: 2000
}
#default['zookeeper']['config']['server.1'] =
'ip-172-31-40-37.eu-west-1.compute.internal:2888:3888'
set[:zookeeper][:config][:initLimit] = 11
set[:zookeeper][:config][:syncLimit] = 17
Recipe:
i = 0
search(
:node,
"roles:zookeeper AND chef_environment:#{node.chef_environment}"
).sort.map do |n|
i += 1
node.set['zookeeper']['config']["server.#{i}"] = "#{n['fqdn']}:2888:3888"
end
include_recipe 'zookeeper::default'
file '/var/lib/zookeeper/myid' do
content node['zookeeper']['config'].key("#{node['fqdn']}:2888:3888").sub 'server.', ''
owner 'zookeeper'
group 'zookeeper'
end
include_recipe 'zookeeper::service'
Error: When running chef-client, I'm getting an error:
================================================================================
Recipe Compile Error in /var/chef/cache/cookbooks/wrapper/recipes/zk.rb
================================================================================
NoMethodError
-------------
undefined method `sub' for nil:NilClass
Cookbook Trace:
---------------
/var/chef/cache/cookbooks/wrapper/recipes/zk.rb:38:in `block in from_file'
/var/chef/cache/cookbooks/wrapper/recipes/zk.rb:37:in `from_file'
Relevant File Content:
----------------------
/var/chef/cache/cookbooks/wrapper/recipes/zk.rb:
31: i += 1
32: node.set['zookeeper']['config']["server.#{i}"] = "#{n['fqdn']}:2888:3888"
33: end
34:
35: include_recipe 'zookeeper::default'
36:
37: file '/var/lib/zookeeper/myid' do
38>> content node.set['zookeeper']['config'].key("#{node['fqdn']}:2888:3888").sub 'server.', ''
39: owner 'zookeeper'
40: group 'zookeeper'
41: end
42:
43: include_recipe 'zookeeper::service'
44:
Running handlers:
Running handlers complete
Chef Client failed. 0 resources updated in 04 seconds
Can you assist?
Cheers
Ali.
@aktarali this is a straightforward; the value of node['zookeeper']['config'].key("#{node['fqdn']}:2888:3888")
is nil
, so the sub
method doesn’t exist.
In other words, there isn‘t a key in the node['zookeeper']['config']
attribute with the value "#{node['fqdn']}:2888:3888"
. This might be a chicken & egg problem on first convergence of the first node of the cluster. A node’s Chef attributes aren’t written to the Chef Server until after first convergence completes, and so, the first node won’t find any nodes with that search, and thus, won’t populate node['zookeeper']['config']
with any servers (i.e., node['zookeeper']['config']['server.1']
doesn’t exist, nor any other numbered servers).
To your earlier point, how do you plan to handle when a ZooKeeper node goes down? As in, the underlying hardware (whether that’s bare steel or a virtual machine) fails, and now you need to quickly re-provision & add a new node?
@JeffByrnes I've gone passed that problem by reordering the run order. The other problem I now have is that after myid file is updated, the zookeeper process is not restarting and as a result the node is not picking up its new myid. Any thoughts?
Ali.
On 4 April 2016 at 14:32, Jeff Byrnes notifications@github.com wrote:
@aktarali https://github.com/aktarali this is a straightforward; the value of node['zookeeper']['config'].key("#{node['fqdn']}:2888:3888") is nil, so the sub method doesn’t exist.
In other words, there isn‘t a key in the node['zookeeper']['config'] attribute with the value "#{node['fqdn']}:2888:3888". This might be a chicken & egg problem on first convergence of the first node of the cluster. A node’s Chef attributes aren’t written to the Chef Server until after first convergence completes, and so, the first node won’t find any nodes with that search, and thus, won’t populate node['zookeeper']['config'] with any servers (i.e., node['zookeeper']['config']['server.1'] doesn’t exist, nor any other numbered servers).
To your earlier point, how do you plan to handle when a ZooKeeper node goes down? As in, the underlying hardware (whether that’s bare steel or a virtual machine) fails, and now you need to quickly re-provision & add a new node?
— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub https://github.com/SimpleFinance/chef-zookeeper/issues/162#issuecomment-205297193
Ali Aktar Zircon Solutions LTD
@aktarali I'm not sure you should change the myid for an existing node. That sounds bad as I understand it.
@neil-greenwood For an initial cluster setup, automating the zookeeper install and allocating its correct MYID on the fly is a little cumbersome. Once MYID is set (from the values in zoo.cfg) zookeeper service has to be restarted with a chef resource handler "notifies" for service to restart if file has changed. When zookeeper is first initially provisioned, myid file does not exist and renders the Mesos cluster unusable. Myid has to be set somehow and service reloaded. I have not re-allocated MYID from one node to another, its a fair point, I wonder what would happen? If all nodes are sync'd, would it matter?
I can't wait for Mesos to adopt consul. ZK is a bad choice i believe for mesos.
I think something similar to this in chef would work:
bash 'add_myid' do
user 'zookeeper'
group 'zookeeper'
code <<-EOH
host=`hostname`
file=`find / \! -type l -name zoo.cfg`
zoovalue=`grep $host $file|cut -d"=" -f1|cut -d"." -f2`
if [[ $zoovalue -eq `cat /var/lib/zookeeper/myid` ]]
then
echo "Nothing to do as myid file value is correct as per zoo.cfg"
else
cat `find / \! -type l -name zoo.cfg`|grep `hostname`|cut -d"=" -f1|cut -d"." -f2 > /var/lib/zookeeper/myid
fi
EOH
notifies :restart, 'service[zookeeper]', :immediately
end
@aktarali so the myid
file should be created by this resource I provided to you:
file '/var/lib/zookeeper/myid' do
content node['zookeeper']['config'].key("#{node['fqdn']}:2888:3888").sub 'server.', ''
owner 'zookeeper'
group 'zookeeper'
end
However, as I said, you’re gonna run into a chicken & egg problem on first convergence. You mentioned you solved that by changing the recipe run order.
I should point out, you do not want to change a node’s myid
, ever, once it’s live. So you want to start it up with the correct myid
in place already. This is why I placed include_recipe 'zookeeper::service'
after the file[/var/lib/zookeeper/myid]
resource, so that the service would not be started until the myid
was in place.
I’m going to have to bow out of providing any more assistance at this point, I just don’t have any more cycles right now. My recommendation stands to use Exhibitor to take care of all of this for you. We have a system that looks like this:
Everything works brilliantly, I do not wrangle with the myid
file at all, and I haven’t touched our ZK cluster in months, maybe even a year. I appreciate the desire to decrease your dependencies, but I’d encourage you to take advantage of what the ecosystem has provided. Netflix already solved this problem for us, and we have a great Exhibitor cookbook to help you do it with Chef.
@aktarali Really have to second @jeffbyrnes here. I hate having extra dependencies, but Exhibitor is actually the way to go:
Thats great guys. Really appreciate your help.
@aktarali I believe we’ve answered your questions to the best of our abilities. I’m going to close this out; please feel free to re-open or create a new issue if you have further questions.
Hi Guys;
Wanted to say you guys are doing an absolutely marvellous job. I noticed my zk cluster an mesos cluster does not register the slaves properly without the /var/lib/zookeeper/myid reflects the server ID as per the zoo.cfg.
How come that file is not being populated as its so fundamental to the operation of zookeeper and mesos?
or Am I missing something?
Thanks Ali.