bloomberg / chef-bach

Chef recipes for Bloomberg's deployment of Hadoop and related components
Apache License 2.0
61 stars 66 forks source link

cluster-assign-roles fails with Net:SSH:HostKeyMismatch when rebuilding cluster #1280

Open cbaenziger opened 6 years ago

cbaenziger commented 6 years ago

When deleting a cluster and reusing a bootstrap node, one will get an error like the following:

Refreshed chef-vault item ssh_host_keys/iptables-bcpc-vm1.bcpc.example.com
Refreshed chef-vault item ssh_host_keys/iptables-bcpc-vm2.bcpc.example.com
iptables-bcpc-vm2.bcpc.example.com: Cheffing with runlist 'role[Basic],recipe[bcpc::default],recipe[bcpc::networking]'
bundler: failed to load command: ./cluster_assign_roles.rb (./cluster_assign_roles.rb)
Net::SSH::HostKeyMismatch: fingerprint d0:bb:35:f8:c5:15:bb:dc:89:40:67:9c:d9:71:a1:02 does not match for "10.0.109.12"
  /home/vagrant/chef-bcpc/ruby/2.4.0/gems/net-ssh-4.2.0/lib/net/ssh/verifiers/secure.rb:48:in `process_cache_miss'
  /home/vagrant/chef-bcpc/ruby/2.4.0/gems/net-ssh-4.2.0/lib/net/ssh/verifiers/secure.rb:33:in `verify'
  /home/vagrant/chef-bcpc/ruby/2.4.0/gems/net-ssh-4.2.0/lib/net/ssh/verifiers/strict.rb:16:in `verify'
  /home/vagrant/chef-bcpc/ruby/2.4.0/gems/net-ssh-4.2.0/lib/net/ssh/verifiers/lenient.rb:15:in `verify'

Unexpectedly, even if ~/.ssh/known_hosts is deleted, one gets this issue coming back. This is due to the SSH keys being stored in Knife vault and being replaced on the hosts[1] -- which are different than they were after the OS install. We should setup cluster-assign-roles to pre-load the correct host keys into the ssh known_hosts file or to ignore this issue when rebuilding a machine (maybe a rebuild/ignore known hosts flag)?

If not a production environment, one can also run the following to not overwrite the new ssh host key:

knife data bag delete -y ssh_host_keys <fqdn>
knife data bag delete -y ssh_host_keys <fqdn>_keys
aespinosa commented 6 years ago

The repxe-host.sh script orchestrates all of this. if you are deleting a cluster, you need to do everything that repxe-host.sh does to get it properly cleaned.

cbaenziger commented 6 years ago

Ah this is a VM specific issue for testing. I can not see how to apply repxe-host.sh to that yet...

aespinosa commented 6 years ago

I'd argue that would be a scope from something between tests/automated_install.sh and the layer right above cluster-assign-roles.sh then. not cluster-assign-roles.sh itself.

cbaenziger commented 6 years ago

@aespinosa Ah good idea; yes, my hang-up was this broke the idempotency of tests/automated_install.sh so it may be possible to envision it doing the necessary work rather than cluster_assign_roles.rb.