ansible-collections / community.vmware

Ansible Collection for VMware
GNU General Public License v3.0
348 stars 337 forks source link

orphan lockfiles and folders pile up in datastore #1858

Open karniemi opened 1 year ago

karniemi commented 1 year ago
SUMMARY

We recently noticed that our datastores are polluted with folders like my-vm,my-vm_1,my-vm_2, etc. That is, the virtual machine name with an underscore and number suffix added to it. Only the most recent contains the actual files for the virtual machine. The older folders contain only vmware lock files like: ".lck-0f0xxxxxxxxx"

Info about the vmware lock files:

We are using vmware_guest-module to create and delete virtual machines in continous integration rounds. Only some of the test rounds are leaving these left-overs, so the leakage is caused by some sort of race condition.

We are strongly suspecting that deleting of the virtual machine is causing this:

    - name: delete the VM
      vmware_guest:
        hostname: "{{ vcenter.hostname }}"
        username: "{{ vcenter.username }}"
        password: "{{ vcenter.password }}"
        validate_certs: False
        datacenter: "{{ vcenter_datacenter  }}"
#        folder: "{{ vcenter_root_folder }}/{{ vcenter_folder }}"
#        name: "{{ inventory_hostname }}"
        name: # workaround for ansible/ansible:#32901 
        uuid: "{{ result.instance.hw_product_uuid }}"
        state: absent
        force: yes #ie. poweroff before delete..github:ansible/ansible:#37000
      when: "'instance' in result and result.instance.hw_product_uuid is defined"
      delegate_to: localhost

The virtual machine is powered off by the module, and then it deletes the machine. We suspect that the power-off is not properly completed so that all the locks would be removed, when the delete already starts. And we suspect this causes the delete to be incomplete.

After delete, the virtual machines are recreated, and now that there already exists a folder with the virtual machines name and one lock-file in it, vmware decides to create a new folder with "_"-suffix.

The problem is not causing any immediate failure for any of the operations - but the folders and lock files are making the datastore messy.

ISSUE TYPE
COMPONENT NAME

vmware_guest

ANSIBLE VERSION
[root@01fb50763f87 /]# ansible --version
ansible 2.9.27
  config file = /etc/ansible/ansible.cfg
  configured module search path = [u'/root/.ansible/plugins/modules', u'/usr/share/ansible/plugins/modules']
  ansible python module location = /usr/lib/python2.7/site-packages/ansible
  executable location = /usr/bin/ansible
  python version = 2.7.5 (default, Nov 20 2015, 02:00:19) [GCC 4.8.5 20150623 (Red Hat 4.8.5-4)]
COLLECTION VERSION
CONFIGURATION
OS / ENVIRONMENT

vCenter 8 python2-pyvmomi-7.0.1-2.el7

STEPS TO REPRODUCE

We are running tens of builds per day, and each of the builds is executing ansible modules tens/hundreds of times. We have no exact steps to reproduce because the problem is sporadic and happens only sometimes. We already got tens/hundreds of these left-over folders now, in our datastore.

EXPECTED RESULTS

vmware_guest should handle the poweroff and delete so that there would be no leftovers for the deleted vm in datastore.

ACTUAL RESULTS

After deleting the vm, the virtual machine folder is sometimes left in the datastore with a lock file in it.

karniemi commented 1 year ago

In a way, this might be similar to https://github.com/ansible-collections/community.vmware/issues/1820 : powering off causes a transitioning period, and during that period any further actions on the virtual machine are not acting on a consistent state. Of course, this is just a hypothesis.

karniemi commented 1 year ago

And here is our current playbook to delete a vm, with all the work-arounds. For the hanging lock-files and folders, there's a new task for deleting the datastore folder using python and pyVmomi directly. Of course, "vmware_guest"-module with "state=absent" should be sufficient for all of this.

---
#- name: Power-off 
#  hosts: snapshooted_vms 
#  gather_facts: no
#  tasks:
#    - name: Power-off nodes (could be done by the "delete old VMs"-task, but it's not parallel due to github:ansible/ansible:#37254)
#      vmware_guest:
#        hostname: "{{ vcenter.hostname }}"
#        username: "{{ vcenter.username }}"
#        password: "{{ vcenter.password }}"
#        validate_certs: False
#        name: "{{ inventory_hostname }}"
#        datacenter: "{{ vcenter_datacenter  }}"
#        folder: "{{ vcenter_root_folder }}/{{ vcenter_folder }}"
#        state: poweredoff
#      failed_when: False
#      delegate_to: localhost

- name: Delete VM
  hosts: snapshooted_vms 
  gather_facts: no
  tasks:
    - name: find the VM uuid
      vmware_guest_info:
        hostname: "{{ vcenter.hostname }}"
        username: "{{ vcenter.username }}"
        password: "{{ vcenter.password }}"
        datacenter: "{{ vcenter_datacenter  }}"
        folder: "{{ vcenter_root_folder }}/{{ vcenter_folder }}"
        name: "{{ inventory_hostname }}"
        validate_certs: no
      delegate_to: localhost
      register: vm_info
      failed_when: false

    # Power off could be done by the "delete the VM"-task, if the work-arounds were not needed :(
    - name: power off
      vmware_guest:
        hostname: "{{ vcenter.hostname }}"
        username: "{{ vcenter.username }}"
        password: "{{ vcenter.password }}"
        validate_certs: False
        datacenter: "{{ vcenter_datacenter  }}"
        name: # workaround for ansible/ansible:#32901
        uuid: "{{ vm_info.instance.hw_product_uuid }}"
        state: poweredoff
      when: "'instance' in vm_info and vm_info.instance.hw_product_uuid is defined"
      delegate_to: localhost
      register: poweroff
      until: poweroff is success # https://github.com/ansible-collections/community.vmware/issues/1820
      retries: 10
      delay: 5

    # Workaround
    # * https://github.com/ansible-collections/community.vmware/issues/1820
    # * https://github.com/ansible-collections/community.vmware/issues/1858
    # * this is only the best guess: waiting for the problem noticed in 1820, ie. hw_guest_ha_state transient state to be pass
    #   * but this might also help 1858, if this bypasses the assumed transient period of power-off and releasing the locks
    # * Note!
    #   * As a further improvement
    #     * could(?) wait for any tasks for the VM to finish. For example, if vMotion decided to migrate the machine just in the "wrong" spot of time.
    #     * could try freezing the vm to a host, to prevent VMotion from interventing - in case it causes some of the problem(s)
    #   * for 1858, it was first tried to wait for the ".lck-"-files to disappear by using ansible modules and also via the "/folder/"-url of vCenter. Unfortunately, any dot-files are not listed via those APIs :(.
    #   * This step waits only if the vSphere HA is enabled!
    #     * so this task does not wait in our PET-env, which does not have vSphere HA. But that's OK: 1) 1820 is triggered by vSphere HA, so in PET-env we should not see it 2) PET-env does not use NFS3-datastores, which triggers 1858.
    - name: wait power off transient state to complete(?)
      vmware_guest_info:
        hostname: "{{ vcenter.hostname }}"
        username: "{{ vcenter.username }}"
        password: "{{ vcenter.password }}"
        datacenter: "{{ vcenter_datacenter  }}"
        folder: "{{ vcenter_root_folder }}/{{ vcenter_folder }}"
        name: "{{ inventory_hostname }}"
        validate_certs: no
      when: "'instance' in vm_info and vm_info.instance.hw_product_uuid is defined"
      delegate_to: localhost
      register: state
      until: "state.instance.hw_guest_ha_state is none"
      retries: 10
      delay: 5

#- name: Delete VM
#  hosts: snapshooted_vms 
#  gather_facts: no
#  serial: 1 #workaround for github:ansible/ansible:#37254
#  tasks:
    - name: delete the VM
      vmware_guest:
        hostname: "{{ vcenter.hostname }}"
        username: "{{ vcenter.username }}"
        password: "{{ vcenter.password }}"
        validate_certs: False
        datacenter: "{{ vcenter_datacenter  }}"
#        folder: "{{ vcenter_root_folder }}/{{ vcenter_folder }}"
#        name: "{{ inventory_hostname }}"
        name: # workaround for ansible/ansible:#32901 
        uuid: "{{ vm_info.instance.hw_product_uuid }}"
        state: absent
        force: yes #ie. poweroff before delete..github:ansible/ansible:#37000
      when: "'instance' in vm_info and vm_info.instance.hw_product_uuid is defined"
      delegate_to: localhost

    # work-around for S17SD-5545 and https://github.com/ansible-collections/community.vmware/issues/1858
    # * delete the vm folder from datastore. fileManager.DeleteFile() is recursive, so all the hanging ".lck-xxx"-files are removed too
    # * unfortunately, there was no rest API in vCenter -nor ansible module to do this. And well ... of course "vmware_guest: state=absent" should already do all of this.
    - name: delete the hanging locks and datastore folder(brute force, workaround for S17SD-5545)
      command: /usr/bin/python2
      args:
        stdin: |
          from pyVim.task import WaitForTask
          from pyVim.connect import SmartConnectNoSSL,Disconnect
          from pyVmomi import vim
          import atexit

          def get_obj(content, vimtype, name):
              """
              Return an object by name, if name is None the
              first found object is returned
              """
              obj = None
              container = content.viewManager.CreateContainerView(
                  content.rootFolder, vimtype, True)
              for c in container.view:
                  if name:
                      if c.name == name:
                          obj = c
                          break
                  else:
                      obj = c
                      break

              return obj

          serviceInstance = SmartConnectNoSSL(host="{{ vcenter.hostname }}",
                                              user="{{ vcenter.username }}",
                                              pwd="{{ vcenter.password }}",
                                              port=443)
          atexit.register(Disconnect, serviceInstance)
          content = serviceInstance.RetrieveContent()
          dc_obj = get_obj(content, [vim.Datacenter], "{{ vcenter_datacenter  }}")

          try:
              WaitForTask(content.fileManager.DeleteFile("{{ vm_data_store_folder }}", dc_obj))
          except vim.fault.FileNotFound as e:
              print ( "{{ folder_already_absent_msg }}" )
      register: delete_result
      delegate_to: localhost
      changed_when: not folder_already_absent_msg in delete_result.stdout
      vars:
        vm_data_store_folder: "{{ vm_info.instance.hw_files | select('search', inventory_hostname ) | first | dirname }}"
        folder_already_absent_msg: "folder is already absent, which is OK."

      until: delete_result is success
      retries: 10
      delay: 6

      when: "'instance' in vm_info and vm_info.instance.hw_product_uuid is defined"

    # - debug:
    #     var: delete_result

    # S17SD-5545: Waiting for the folder to be removed was considered one option as a work-around
    # * but, the problem is (propably) caused by the power-off, which (probably) leaves lock files hanging, and because of the lock files the datastore-folder ( probably ) can't be removed when the vm is deleted.
    # * ...so waiting for the datastore-folder to disappear after deleting the VM was assumed to be useless - because it probably would not disappear.
    # - name: wait until the folder is removed
    #   vsphere_file:
    #     hostname: "{{ vcenter.hostname }}"
    #     username: "{{ vcenter.username }}"
    #     password: "{{ vcenter.password }}"
    #     datacenter: "{{ vcenter_datacenter  }}"
    #     datastore: "{{ item }}"
    #     validate_certs: no
    #     path: inventory_hostname
    #     state: file
    #   delegate_to: localhost

    #   register: datastore_folder
    #   failed_when: datastore_folder.state != 'absent'

    #   until: datastore_folder is success
    #   retries: 10
    #   delay: 6

    #   loop:
    #     - "{{ vcenter_datastore }}_01"
    #     - "{{ vcenter_datastore }}_02"

    #   when: "'instance' in vm_info and vm_info.instance.hw_product_uuid is defined"

#- name: Confirm VM deletion (vmware_guest:absent should guarantee it, though )
#  hosts: snapshooted_vms 
#  gather_facts: no
#  tasks:
#    - name: confirm/wait that VMs are really deleted(vmware_guest in prev task *should* actually do that already...)
#      vsphere_guest:
#        vcenter_hostname: "{{ vcenter.hostname }}"
#        username: "{{ vcenter.username }}"
#        password: "{{ vcenter.password }}"
#        guest: "{{ inventory_hostname }}"
#        vmware_guest_facts: yes
#        validate_certs: no
#      register: vm_info
#      failed_when: false # this skips the crit error returned by vspere_guest module ...so we can actually check the "until"-condition
#      until: "'No such VM' in vm_info.msg|default('')"
#      retries: 2
#      delay: 5
#      delegate_to: localhost
#   - debug:
#        var: vm_info
#    - name: fail if VM is not non-existent
#      fail:
#        msg: "VM was still there...or we failed to find out the status properly."
#      when: "'No such VM' not in vm_info.msg|default('')"