ansible-collections / community.vmware

Ansible Collection for VMware
GNU General Public License v3.0
352 stars 336 forks source link

vmware_deploy_ovf timing out for large ovf #2253

Closed agowa closed 1 week ago

agowa commented 1 week ago
SUMMARY

community.vmware.vmware_deploy_ovf is failing and reporting a timeout even though the OVF is still being imported and status is still being reported within the webinterface.

ISSUE TYPE
COMPONENT NAME

community.vmware.vmware_deploy_ovf

ANSIBLE VERSION
ansible [core 2.17.2]
  config file = /workspaces/ansible-playbook/ansible.cfg
  configured module search path = ['/root/.ansible/plugins/modules', '/usr/share/ansible/plugins/modules']
  ansible python module location = /usr/local/lib/python3.13/site-packages/ansible
  ansible collection location = /workspaces/ansible-playbook
  executable location = /usr/local/bin/ansible
  python version = 3.13.0 (main, Nov 12 2024, 21:21:29) [GCC 13.2.1 20240309] (/usr/local/bin/python3.13)
  jinja version = 3.1.4
  libyaml = False
COLLECTION VERSION
Collection       Version
---------------- -------
community.vmware *
CONFIGURATION

unrelated

OS / ENVIRONMENT

Docker container: python:3.13-alpine3.20 with some additional packages (unrelated to this issue)

STEPS TO REPRODUCE
  1. Deploy a big ova or use a very slow esxi server (e.g. CPU throttled nested ESXi)
  2. Run an ansible task like:
- name: Deploy VCSA
  delegate_to: localhost
  community.vmware.vmware_deploy_ovf:
    name: '{{ inventory_hostname }}'
    hostname: '{{ esxi_hostname }}'
    username: '{{ esxi_username }}'
    password: '{{ esxi_password }}'
    datastore: "datastore1"
    disk_provisioning: 'thin'
    networks:
      "Network 1": "VM Network"
    ovf: '{{ playbook_dir }}/files/my.ova'
    wait_for_ip_address: true
    validate_certs: '{{ validate_certs }}'
    inject_ovf_env: true
    # properties: (unrelated to issue)
EXPECTED RESULTS

The ansible playbook polling the status of the Task instead of just using a generic timeout.

Alternatively either significantly increase the timeout (by 10x) or allow for a user provided one.

ACTUAL RESULTS

image

FAILED! => {"changed": false, "msg": "Waiting for IP address timed out"}
ihumster commented 1 week ago

@agowa First of all, check that this operation takes the same time when you do it through the UI or, for example, using powercli. In the case of downloading an OVA file on esxi, the problem may be, for example, low network speed between your device and the esxi management interface. As one of the authors of this module, I can say that at the file downloading stage everything is already simple and primitive - problems can arise at the L2-L4 network level, but not at L7.

mariolenz commented 1 week ago

Looking at the code, I think the Waiting for IP address timed out error is after the OVF has been deployed successfully. I guess the VM doesn't get an IP address, or the module runs into a time-out before it does.

@ihumster Maybe we should add a wait_for_ip_address_timeout option like in vmware_guest:

https://github.com/ansible-collections/community.vmware/blob/324a44eb182ed6a83bc75f8d78eb6ef4598c600a/plugins/modules/vmware_guest.py#L3157-L3158

On the other hand, I think the default timeout is 5 minutes:

https://github.com/ansible-collections/community.vmware/blob/324a44eb182ed6a83bc75f8d78eb6ef4598c600a/plugins/module_utils/vmware.py#L163

So it might be another problem. But having a wait_for_ip_address_timeout wouldn't hurt, would it? Some OVFs might take more than 5 minutes to initialize completely and get an IP.

ihumster commented 1 week ago

@mariolenz 5 minutes (300 sec by default) is absolutely normal value for start any VM and initialize network subsystem with ip address (especially for ovf appliances). If VM not get Ip address for this time - maybe something wrong with this VM or with network connectivity between controller machine and new VM?

ihumster commented 1 week ago

@agowa After you get error message from ansible about "Waiting for IP address timed out" you try ping VM IP from your ansible controller?

agowa commented 1 week ago

Considering the above I tried it again and this time it worked just fine even though it took a while. Maybe some kind of network fluke that caused it to skip some check?

The strange thing was (as you can see in the initial post) the OVA was still deploying while the ansible already failed within the waiting or IP address stage. Which it shouldn't even have been in yet if I understand you correctly.

And no, I didn't try to ping the IP when it failed, as it still took a while for the OVA to even finish being deployed. And only after that it was powered on.

ihumster commented 1 week ago

@agowa In this case, it looks like the connection has been lost.

agowa commented 1 week ago

That's very interesting, as the OVA was being uploaded from the Ansible controller. Do you know how it was still being uploaded even though the Ansible had already terminated?

Anyway, this appears to be some very interesting edge case then...

ihumster commented 1 week ago

In order to leave this issue open, you need a working plan to reproduce this issue.

agowa commented 1 week ago

It works now. Must have been some fluke. I'll reopen the ticket should I find something reproducible.