ansible-collections / community.vmware

Ansible Collection for VMware
GNU General Public License v3.0
351 stars 337 forks source link

Linux deployment from template by Ansible not working. Network in disconnected state. #1991

Open Vibhanshuj49 opened 9 months ago

Vibhanshuj49 commented 9 months ago

Hi,

I have been trying to deploy the linux VM from the below playbook via template and it's deploying fine however it's not connecting to the network, and it stays in the disconnected state. On the VM I see this: - "A start job is running for wait for Network to be configured" and after some time it timed out.

As part of troubleshooting, we tried below but nothing worked. Any suggestion to fix this?

  1. Tried with Ubuntu and Centos image.
  2. Upgraded ansible version.
  3. Tried with perl and curl package

aydinguven commented 9 months ago

Can you try installing vmtools or open-vm-tools on the template.

MaximilianClemens commented 9 months ago

Do you use distributed vSwitches and normal vSwitches parallel? I noticed problems when there are portgroups with the same name on different switches.

FireHelmet commented 9 months ago

Hello,

I encounter the same problem. My VM is a Rocky Linux and I already have the VMtools pre-installed in my template.

I tried with and without the options, connected: true start_connected: true

I don't have the issue with Windows template. All of my VMs are using the same vSwitch which is a 'distributed vSwitch'

I use the version 4.1.0 of the collection and ansible-core 2.16.3

Thank you for your support.

FireHelmet commented 8 months ago

Hi @mariolenz ,

May I request your help please ?

Thank you!

ihumster commented 8 months ago

@FireHelmet Try some simple experiments to determine the scope of the problem:

  1. Create a VMware Standard Switch on same host (without uplinks), create in this VSS portgroup with simple name (for example "test_pg")
  2. Run your playbook with task based on community.vmware.vmware_guest module with a specific host, switch and portgroup selected.

If the VM was created from a template and connected to the network, the problem is probably not in the template and, PROBABLY, not in module. To exclude (or confirmation of the problem) the module, it may make sense to create a separate portgroup in your VDS for further testing.

FireHelmet commented 8 months ago

Hello @ihumster ,

Thank you for your quick answer.

I tested what you proposed by doing,

  1. A new port group named 'test' in a standard vSwitch on one of my host in the same cluster
  2. Deployed a VM from the same template with the same playbook excepted by adding the key for using the ESXi host where my new port group was created and by adding the name of this new port group (of course)
  3. The VM has been deployed correctly and assigned to the new port group but the network card is still disconnected,

Please find evidences below, image

image

Please find my playbook I used for the deployment and the result below,

- hosts: all
  gather_facts: yes
  connection: local
  environment:
    VMWARE_VALIDATE_CERTS: false

  tasks:
    - name: Deploy new virtual machine from template '${option.vm-tpl}'
      community.vmware.vmware_guest:
        hostname: MY_VCENTER
        username: MY_USER
        password: MY_PASSWORD
        datacenter: MY_DC
        esxi_hostname: MY_ESX_HOST
        folder: /
        state: poweredon
        name: FROT998
        template: TPL-Linux_Rocky_9.3
        datastore: DS003
        hardware:
          memory_mb: 4096
          num_cpus: 2
          num_cpu_cores_per_socket: 1
          version: 20
        networks:
          - name: test
            connected: true
            start_connected: true
      delegate_to: localhost

Thank you for your support,

ihumster commented 8 months ago

@FireHelmet Hmm. Why Checkbox Connected At Powered On is disabled? Does the template have the same setting? If so, then this is the cause of the error.

FireHelmet commented 8 months ago

I don't know why this checkbox is disabled on the Linux VM, also I don't have the issue with Windows deployment.

No, the VM used as template has the checkbox enabled (I reconverted the template as a VM to show you this setting because this kind of setting can't be changed or looked when the VM is converted as a template),

image

ihumster commented 8 months ago

I found similar issue on vmware forums. Look take on this commentary https://communities.vmware.com/t5/vCenter-Server-Discussions/deployed-VM-from-template-but-NIC-is-disconnected/m-p/2977937/highlight/true#M94606

FireHelmet commented 8 months ago

Thanks @ihumster ,

I tried to add ethernet0.startConnected = "TRUE" but the option disappear after reconverting the VM to template even if when I reconvert again to VM the checkbox is still enabled.

Also I'm not using any customization template.

I did a last test by doing a deployment of my template from the vCenter, so by hand...and the checkbox is well enabled so I'm thinking the problem comes from the ansible collection OR pyvmomi

See capture of the deployment by hand of FROT997 below,

image

image

What's your opinion ?

Many thanks for your support

ihumster commented 8 months ago

@FireHelmet Need to check something else: you can try add no network section of your playbook allow_guest_control property and set it to false?

    networks:
          - name: test
            connected: true
            start_connected: true
            allow_guest_control: false
FireHelmet commented 8 months ago

@ihumster ,

Still disconnected,

image

The playbook I used,

- hosts: all
  gather_facts: yes
  connection: local
  environment:
    VMWARE_VALIDATE_CERTS: false

  tasks:
    - name: Deploy new virtual machine from template '${option.vm-tpl}'
      community.vmware.vmware_guest:
        hostname: MY_VCENTER
        username: MY_USER
        password: MY_PASSWORD
        datacenter: MY_DC
        esxi_hostname: MY_ESX_HOST
        folder: /
        state: poweredon
        name: FROT998
        template: TPL-Linux_Rocky_9.3
        datastore: DS008
        hardware:
          memory_mb: 4096
          num_cpus: 2
          num_cpu_cores_per_socket: 1
          version: 20
        networks:
          - name: test
            connected: true
            start_connected: true
            allow_guest_control: false
      delegate_to: localhost
MaximilianClemens commented 8 months ago

Hello @FireHelmet,

can you check if there are events like "customization started" at the deployed vm. (In the UI > Monitor > events). Can you login to the console of the deployed vm and check in /var/log for cloud-init logs that are generated after the cloning.

And as last test can just test this:

networks:
          - name: SRV-LAN

(remove connected, start_connected, allow_guest_control, when those settings are set it seems like the module does this:

if nic_change_detected:
                # Change to fix the issue found while configuring opaque network
                # VMs cloned from a template with opaque network will get disconnected
                # Replacing deprecated config parameter with relocation Spec
                if isinstance(net_obj, vim.OpaqueNetwork):
                    self.relospec.deviceChange.append(nic)
                else:
                    self.configspec.deviceChange.append(nic)
                self.change_detected = True

Maybe that ist related to this issue?

Regards Maximilian

FireHelmet commented 8 months ago

Hello @MaximilianClemens ,

Yes, please see below on a newly created VM from same template and same setting except the name,

image

Also, no cloud-init log because I don't use customization feature from vSphere,

image

About the test of - name: SRV-LAN, I already test and same result, unfortunately.

What's an "opaque network" ?

Thank you for your support too.

MaximilianClemens commented 8 months ago

Is there anything under Events in the vsphere ui?

I don't know what a opaque network is, but my theory was, that something triggers a customization, even when not wanted and this customization fails. that failure would result in a disconnected adapter.

FireHelmet commented 8 months ago

@MaximilianClemens ,

No, nothing related to "customization" and as I wrote, the issue doesn't appear when I deploy the VM from the same template but manually from the vCenter UI. Also, no issues with Windows templates.

Currently I used a workaround by using a powershell script with PowerCLI running on a Windows host and this ansible playbook

- hosts: all
  gather_facts: no
  tasks:
  - name: Connect the Network Interface of '${option.vm-name}'
    ansible.windows.win_powershell:
      script: |
        Set-PowerCLIConfiguration -InvalidCertificateAction Ignore -ParticipateInCEIP $false -Confirm:$false | out-null
        Connect-VIServer -Server ${option.vcenter-hostname} -User ${option.vcenter-username} -Password ${option.vcenter-password}
        Get-VM ${option.vm-name} | Get-NetworkAdapter | Set-NetworkAdapter -StartConnected:$true -Connected:$true -Confirm:$false
        Disconnect-VIServer -Server ${option.vcenter-hostname} -Confirm:$false

I use Rundeck on top of Ansible it's the reason why the variables have this format ${option.xxx}

ihumster commented 8 months ago

@MaximilianClemens FYI "Opaque network" is term from NSX-T. Used for portgroups, which creates NSX-T manager on-top N-VDS (on current version NSX-T not used, and exists for compatibility).

ihumster commented 8 months ago

@FireHelmet I guess we'll have to dive into some deep debugging. Please deploy a new VM from this template and send here the machine log - vmware.log from the directory on the datastore (Just post it please on paste.bin for example).

FilipFabicevic commented 8 months ago

Thanks @ihumster ,

I tried to add ethernet0.startConnected = "TRUE" but the option disappear after reconverting the VM to template even if when I reconvert again to VM the checkbox is still enabled.

Also I'm not using any customization template.

I did a last test by doing a deployment of my template from the vCenter, so by hand...and the checkbox is well enabled so I'm thinking the problem comes from the ansible collection OR pyvmomi

See capture of the deployment by hand of FROT997 below,

image

image

What's your opinion ?

Many thanks for your support

Where did you add this? So I can try it. BTW I managed to have this working by installing cloud-init on the template but our company does not use cloud-init in production so I have to find workaround.

FireHelmet commented 8 months ago

Hello @ihumster ,

Please find the log here https://pastebin.com/w13XP186 . The retention is 1 month. I sent the password for accessing this link by email to ihumster@ihumster.ru

Thank you very much for your support

FireHelmet commented 8 months ago

Hello @FilipFabicevic ,

I added the key/value in the .vmx of the VM. But as I said this fix doesn't work and not only for me.

ihumster commented 8 months ago

@FireHelmet I looked at the log and didn't see anything interesting about the problem. Perhaps you need to do some more research and look at a piece of hostd.log from the esxi server during VM startup.

For convenience, you can “kick out” all the VMs from one of the hosts (switch DRS to Manual mode) and try to launch the playbook (indicating the deployment of the VM not to the cluster, but to this host) and look at hostd.log at the same time. Perhaps the reason will be visible there. Judging by the VM startup logs, the reason is not in it, but in the vSphere infrastructure

FireHelmet commented 8 months ago

@ihumster ,

Please find the hostd.log here https://pastebin.com/6qPL78hE . The retention is 1 month. I sent the password for accessing this link by email to ihumster@ihumster.ru

I just extracted all logs around the VM ID and/or the name of the VM. I hope this log will help you.

Thank you very much for your support

ihumster commented 8 months ago

@FireHelmet And add vmkernel.log for same time from host.

FireHelmet commented 8 months ago

@ihumster

Please find the vmkernel.log here https://pastebin.com/WDYYsBuQ. The retention is 1 month. I sent the password for accessing this link by email to ihumster@ihumster.ru

I just extracted all logs around the VM ID and/or the name of the VM. I hope this log will help you.

Thank you very much for your support

ihumster commented 8 months ago

@FireHelmet Either the logs are not complete, or there is nothing in them about the VM’s network adapter. You extract logs around VM ID/Name, but need more logs about dvportgroup-66 of your dswitch

djvujke commented 8 months ago

I have same network names on standard switch and dswitch. So when I want to change adapter

- name:  Changing network adapter  
  vmware_guest:
    <<: *vmware_connection
    name: "{{ my_vm.vm_name }}"
    networks:
    - name: "{{ my_vm.vm_network }}"
      ip: "{{ my_vm.vm_ip_address }}"
      netmask: "{{ my_vm.vm_netmask }}"
      state: present
      start_connected: true
      connected: true
      dvswitch_name: "DSwitch" 

Task fails.
TASK [Changing network adapter] ** fatal: [localhost]: FAILED! => {"changed": false, "msg": "Failed to connect virtual device ethernet0. ", "op": "reconfig"}

I do put dvswitch_name to my switches name, but I'm not sure if he really selects network from dvswitch, but it takes one from standard switch When manually select network , I select one from dvswitch and it changes it.

vsphere 7.0.3 U3

FireHelmet commented 8 months ago

Hey @djvujke ,

Please open a new issue because it's not the same topic.

Thanks

jbertozzi commented 8 months ago

Hello,

Just came here to say we encounter the same issue since we migrated to vSphere 8.0.2 (build 23319993) from vSphere 7.

I found similar issue on vmware forums. Look take on this commentary https://communities.vmware.com/t5/vCenter-Server-Discussions/deployed-VM-from-template-but-NIC-is-disconnected/m-p/2977937/highlight/true#M94606

We are currently testing to integrate to the template the following conf:

cat /etc/vmware-tools/tools.conf
[deployPkg]
enable-custom-scripts = true

I will keep you updated.

Regards,

runejuhl commented 8 months ago

FYI I had some apparently similar issues.

I made a hacky workaround by changing the network interfaces to start disconnected when provisioning, and connecting them after the VM was created. This seemed to work nicely, and might work for you as well.

In the end it turned out that my issue was caused by the VM template not having Netplan installed, and VMware expecting it to be available and failing customization because of this. The symptoms were similar enough that I thought I had the same issue as y'all :sweat_smile: