equinix / terraform-equinix-metal-anthos-on-vsphere

[Deprecated] Automated Anthos Installation via Terraform for Equinix Metal with vSphere
https://registry.terraform.io/modules/equinix/anthos-on-vsphere/metal/latest
Apache License 2.0
62 stars 41 forks source link

VCVA deployment times out #119

Closed mattsday closed 2 weeks ago

mattsday commented 3 years ago

When deploying the vCenter virtual appliance, it always fails with a message like this before timing out:

null_resource.deploy_vcva (remote-exec): Error:     Problem Id: None                                                                 
null_resource.deploy_vcva (remote-exec): Component key: setnet     Detail:
null_resource.deploy_vcva (remote-exec): Failed to set the time via NTP. Details:                                                    
null_resource.deploy_vcva (remote-exec): Failed to sync to NTP servers.. Code:                                                       
null_resource.deploy_vcva (remote-exec): com.vmware.applmgmt.err_ntp_sync_failed                                                     
null_resource.deploy_vcva (remote-exec): Could not set up time synchronization.                                                      
null_resource.deploy_vcva (remote-exec): Resolution: Verify that provided ntp                                                        
null_resource.deploy_vcva (remote-exec): servers are valid.                                                                          
null_resource.deploy_vcva (remote-exec):  [FAILED] Task: MonitorDeploymentTask:                                                      
null_resource.deploy_vcva (remote-exec): Monitoring Deployment execution failed                                                      
null_resource.deploy_vcva (remote-exec): at 13:45:13                                                                                 
null_resource.deploy_vcva (remote-exec): ========================================                                                    
null_resource.deploy_vcva (remote-exec): Error message: The appliance REST API                                                       
null_resource.deploy_vcva (remote-exec): was not yet available from the target                                                       
null_resource.deploy_vcva (remote-exec): VCSA 'vcva'because 'Failed to query                                                         
null_resource.deploy_vcva (remote-exec): deployment status for appliance vcva                                                        
null_resource.deploy_vcva (remote-exec): after trying all ip addresses'. The VCSA                                                    
null_resource.deploy_vcva (remote-exec): might still be starting up.                                                                 
null_resource.deploy_vcva (remote-exec): =============== 13:45:13 ===============                                                    
null_resource.deploy_vcva (remote-exec): Result and Log File Information...                                                          
null_resource.deploy_vcva (remote-exec): WorkFlow log directory:
null_resource.deploy_vcva (remote-exec): /tmp/vcsaCliInstaller-2021-02-01-13-12-fzgl6ein/workflow_1612185129050

Details: vCenter ISO: VMware-VCSA-all-6.7.0-15132721.iso esxi_size = c3.medium.x86

I have worked around it by changing the VCVA appliance ntp to time.google.com:

sed -i 's/time\.nist\.gov/time.google.com/' templates/vcva_template.json

This is not a permanent fix, as ideally the deployment would fail at this stage instead of continuing until a timeout. I will also add a PR doing this.

displague commented 3 years ago

Do we understand why this error is happening?

https://vkernelblog.com/vcenter-failed-to-set-the-time-via-ntp-details/ suggests that port 123 may be blocked.

https://kb.vmware.com/s/article/59729 suggests that if the target NTP service does not respond to ICMP, the NTP "valid server" check will fail (prior to vCenter Server 6.7 Update 2).

I can confirm that time.nist.gov does not respond to ICMP. This sounds like the likely cause.

displague commented 3 years ago

Closed based on #120.

Reopening for @mattsday follow-up:

This is not a permanent fix, as ideally the deployment would fail at this stage instead of continuing until a timeout. I will also add a PR doing this.

displague commented 3 years ago

It sounds like this problem could also be solved with a documentation update requiring vCenter Server 6.7 Update 2.

mattsday commented 3 years ago

+1 on a documentation update, would be a much better long-term solution even if switching to time.google.com works around the issue for now