CenturyLinkCloud / chef-provisioning-vsphere

A chef-provisioning provisioner for VMware vSphere
MIT License
66 stars 57 forks source link

New machines are always rebooted after provision #28

Closed cnelson closed 8 years ago

cnelson commented 9 years ago

Each new machine I bring up is rebooted before it's converged. This increases my runtime significantly.

I've commented out this section of code as a test, and provisioning runs fine without a reboot:

https://github.com/CenturyLinkCloud/chef-provisioning-vsphere/blob/19272e9225984a8e3abc8a0a51c978044468430a/lib/chef/provisioning/vsphere_driver/driver.rb#L320

Before I submit a PR to change this behavior, can someone educate me on why this driver reboot each machine after it comes up for the first time? Is it working around some issues in vSphere that I'm haven't run into?

mwrock commented 9 years ago

In most cases that should not be called. That restart occurs as a last ditch effort to ensure that the designated IP is correctly set on the VM. If at this point the ip in VMWare still does not reflect the static IP assigned to the machine, it will reboot in an to ensure the IP is correctly set.

When I see this happen, it usually means there is some other network or domain related issue happening with the box but I cant recall the specifics.

I assume you are trying to give your VMs a static IP? Do you see that IP registered with the VM in your vsphere client with that reboot removed?

cnelson commented 9 years ago

I actually see this issue primarily with machines that are DHCP. The machine gets an address from DHCP, vmware-tools in the guest communicates it back to VSphere, this driver sees the address, then reboots the machine anyway. Relevant snippet from a run:

- IP addresses found: []
- IP addresses found: []
- IP addresses found: ["10.102.101.198", "fe80::250:56ff:fe91:215"]
- rebooting...

Looking at the code it looks like wait_for_ip is succeeding but then the subsequent call to has_ip fails resulting in a reboot... Any suggestions how to debug this further?

mwrock commented 9 years ago

Ahh ok. This makes sense but I don't think removing the reboot is the right fix for this. The current ip waiting logic is rather flawed for DHCP. It may be best to have attempt_ip just return if using DHCP. However, that might cause the wrong ip t obe recorded with the chef machine_spec. The original ip of the vm template could end up in the machine spec which would be bad.

Fixing this would make this faster not only from eliminating the restart but it would eliminate the long wait for the ip that it will never get - minutes.

I just left CenturyLink and no longer have access to vsphere infrastructure to test or futz with that. That may change soon and I will be happy to help if I can.

cnelson commented 9 years ago

I'm actively working on a vSphere based project right now so if I can find the time to diagnose this a bit more, I'll see about a patch to make it a bit more robust.

mwrock commented 9 years ago

I'd really be interested to see if #41 fixes this. I don't have access to an environment where I can test it but if anyone wants to give it a try, 0.8.3.dev has that PR. Please report back results if you do try it.