ceph / ceph-cookbook

Chef cookbooks for Ceph
Apache License 2.0
100 stars 108 forks source link

bootstrap of 3rd, 4th monitor nodes hangs #202

Closed djdefi closed 5 years ago

djdefi commented 9 years ago

Getting stuck when setting up 3rd and 4th monitor nodes. The chef-client run is hanging here:

ruby_block[get osd-bootstrap keyring] action run Appears to be: https://github.com/ceph/ceph-cookbook/blob/master/recipes/mon.rb#L120

hufman commented 9 years ago

Can you show me what your ceph -s looks like? Does the command ceph auth get-key client.bootstrap-osd return any data or just hang?

djdefi commented 9 years ago
# ceph -s
2015-05-27 21:25:46.332544 7f25cff65700 -1 monclient(hunting): authenticate NOTE: no keyring found; disabled cephx authentication
2015-05-27 21:25:46.332552 7f25cff65700  0 librados: client.admin authentication error (95) Operation not supported
Error connecting to cluster: Error

And

# ceph auth get-key client.bootstrap-osd
2015-05-27 21:26:00.737448 7f8862451700 -1 monclient(hunting): authenticate NOTE: no keyring found; disabled cephx authentication
2015-05-27 21:26:00.737488 7f8862451700  0 librados: client.admin authentication error (95) Operation not supported
Error connecting to cluster: Error

I am not using encrypted databags, if that helps any.

djdefi commented 9 years ago

Also the following is from the ceph monitor log from one of the two "working" hosts:

==> /var/log/ceph/ceph-mon.ceph-dev-mon02.log <==
2015-05-27 21:31:29.129900 7fbc2b6e8700  0 cephx: verify_authorizer could not decrypt ticket info: error: NSS AES final round failed: -8190
2015-05-27 21:31:31.127952 7fbc2b6e8700  0 cephx: verify_authorizer could not decrypt ticket info: error: NSS AES final round failed: -8190
2015-05-27 21:31:33.128204 7fbc2b6e8700  0 cephx: verify_authorizer could not decrypt ticket info: error: NSS AES final round failed: -8190
2015-05-27 21:31:35.127758 7fbc2b6e8700  0 cephx: verify_authorizer could not decrypt ticket info: error: NSS AES final round failed: -8190
2015-05-27 21:31:37.128053 7fbc2b6e8700  0 cephx: verify_authorizer could not decrypt ticket info: error: NSS AES final round failed: -8190
2015-05-27 21:31:39.130360 7fbc2b6e8700  0 cephx: verify_authorizer could not decrypt ticket info: error: NSS AES final round failed: -8190
2015-05-27 21:31:41.129810 7fbc2b6e8700  0 cephx: verify_authorizer could not decrypt ticket info: error: NSS AES final round failed: -8190
2015-05-27 21:31:43.128503 7fbc2b6e8700  0 cephx: verify_authorizer could not decrypt ticket info: error: NSS AES final round failed: -8190
2015-05-27 21:31:45.129840 7fbc2b6e8700  0 cephx: verify_authorizer could not decrypt ticket info: error: NSS AES final round failed: -8190
2015-05-27 21:31:45.557156 7fbc2dbf8700  1 mon.ceph-dev-mon02@0(leader).auth v22 client did not provide supported auth type
hufman commented 9 years ago

Hmmm I've forgotten how the client.admin keyring gets created. Can you copy an existing /etc/ceph/ceph.client.admin.keyring file to your 3rd mon and see if that changes anything? We might have to explicitly add the client.admin keyring creation if it does help.

nickvanw commented 8 years ago

I can confirm that I'm seeing this issue as well - after copying the ceph.client.admin.keyring file the machine continues converging without issue.

krenakzsolt commented 8 years ago

Hi All!

I'm seeing this problem too, is anyone come up with a solution?

EDIT:

Okay, I think I found the problem. The ceph.client.admin.keyring is generated automatically if the monitor node successfully joins a cluster or starts one. The root of the problem was that after the second node deployment everyother node failed to join the existing cluster beacuse of auth error. The auth error happend because the 3rd node generated it's own monitor_secret instead of using the existing one:

execute 'format mon-secret as keyring' do # ~FC009 command lazy { "ceph-authtool '#{keyring}' --create-keyring --name=mon. --add-key='#{mon_secret}' --cap mon 'allow *'" } creates keyring only_if { mon_secret } #<------ This was nil, so this step was skipped sensitive true if Chef::Resource::Execute.method_defined? :sensitive end

And this step executed: execute 'generate mon-secret as keyring' do # ~FC009 command "ceph-authtool '#{keyring}' --create-keyring --name=mon. --gen-key --cap mon 'allow *'" creates keyring not_if { mon_secret } notifies :create, 'ruby_block[save mon_secret]', :immediately sensitive true if Chef::Resource::Execute.method_defined? :sensitive end

I can only suspect why this was happenning, but I think beacuse of the mon_secret method: def mon_secret if node['ceph']['encrypted_data_bags'] secret = Chef::EncryptedDataBagItem.load_secret(node['ceph']['mon']['secret_file']) Chef::EncryptedDataBagItem.load('ceph', 'mon', secret)['secret'] elsif !mon_nodes.empty? mon_nodes[0]['ceph']['monitor-secret'] <--- The problem is here, using the [0] element elsif node['ceph']['monitor-secret'] node['ceph']['monitor-secret'] else Chef::Log.info('No monitor secret found') nil end end

I couldn't found out for sure, but I think the search returns the result in a reverse order, so the only node storing the mon_secret is the last one. That's why the second mon node worked(there is only one element in the array) and after that it didn't.

My solution was to save the mon_secret on every node: execute 'format mon-secret as keyring' do # ~FC009 command lazy { "ceph-authtool '#{keyring}' --create-keyring --name=mon. --add-key='#{mon_secret}' --cap mon 'allow *'" } creates keyring only_if { mon_secret } notifies :create, 'ruby_block[save mon_secret]', :immediately <--added this line to save mon_sec sensitive true if Chef::Resource::Execute.method_defined? :sensitive end

And I would suggest to check ceph status if you perviously just copied the keyring file and a chef-client ran successfully. However the chef run is successful the monitor won't join the cluster. So check if you are not running only with 2 monitors.

Cheers

nickvanw commented 8 years ago

I've had success with manually copying the key, but that's obviously not a long-term solution.

On Wednesday, August 12, 2015, krenakzsolt notifications@github.com wrote:

Hi All!

I'm seeing this problem too, is anyone come up with a solution?

— Reply to this email directly or view it on GitHub https://github.com/ceph/ceph-cookbook/issues/202#issuecomment-130207607.

krenakzsolt commented 8 years ago

Does ceph status shows all monitor in cluster? For me it only showed the first two, even if copied the key and chef client ran successfully. and I think that should be normal. Also check monitor nodes attributes with knife node show nodename -m. The monitor_secret attribute should be the same on the first two nodes, but differ on any other.

guilhem commented 8 years ago

I may have find the problem. Working on it ;)

guilhem commented 8 years ago

I don't test it for now, but only want to give you a view of what can be done. Let me know if working

vishalkanaujia commented 8 years ago

I also hit this problem today in Ceph developer build based teuthology run. It turned out to be a problem with old Ceph client admin keys. I had the liberty of deletion with "rm -rf /etc/ceph/*". It worked smoothly after that.

djdefi commented 6 years ago

Going through my old issues and noticed this is still open. I have no plans of pursuing this further personally, as I am no longer using ceph.

@guilhem did you even come up with a fix, or can we close this out as stale ?