Open karniemi opened 1 year ago
In a way, this might be similar to https://github.com/ansible-collections/community.vmware/issues/1820 : powering off causes a transitioning period, and during that period any further actions on the virtual machine are not acting on a consistent state. Of course, this is just a hypothesis.
And here is our current playbook to delete a vm, with all the work-arounds. For the hanging lock-files and folders, there's a new task for deleting the datastore folder using python and pyVmomi directly. Of course, "vmware_guest"-module with "state=absent" should be sufficient for all of this.
---
#- name: Power-off
# hosts: snapshooted_vms
# gather_facts: no
# tasks:
# - name: Power-off nodes (could be done by the "delete old VMs"-task, but it's not parallel due to github:ansible/ansible:#37254)
# vmware_guest:
# hostname: "{{ vcenter.hostname }}"
# username: "{{ vcenter.username }}"
# password: "{{ vcenter.password }}"
# validate_certs: False
# name: "{{ inventory_hostname }}"
# datacenter: "{{ vcenter_datacenter }}"
# folder: "{{ vcenter_root_folder }}/{{ vcenter_folder }}"
# state: poweredoff
# failed_when: False
# delegate_to: localhost
- name: Delete VM
hosts: snapshooted_vms
gather_facts: no
tasks:
- name: find the VM uuid
vmware_guest_info:
hostname: "{{ vcenter.hostname }}"
username: "{{ vcenter.username }}"
password: "{{ vcenter.password }}"
datacenter: "{{ vcenter_datacenter }}"
folder: "{{ vcenter_root_folder }}/{{ vcenter_folder }}"
name: "{{ inventory_hostname }}"
validate_certs: no
delegate_to: localhost
register: vm_info
failed_when: false
# Power off could be done by the "delete the VM"-task, if the work-arounds were not needed :(
- name: power off
vmware_guest:
hostname: "{{ vcenter.hostname }}"
username: "{{ vcenter.username }}"
password: "{{ vcenter.password }}"
validate_certs: False
datacenter: "{{ vcenter_datacenter }}"
name: # workaround for ansible/ansible:#32901
uuid: "{{ vm_info.instance.hw_product_uuid }}"
state: poweredoff
when: "'instance' in vm_info and vm_info.instance.hw_product_uuid is defined"
delegate_to: localhost
register: poweroff
until: poweroff is success # https://github.com/ansible-collections/community.vmware/issues/1820
retries: 10
delay: 5
# Workaround
# * https://github.com/ansible-collections/community.vmware/issues/1820
# * https://github.com/ansible-collections/community.vmware/issues/1858
# * this is only the best guess: waiting for the problem noticed in 1820, ie. hw_guest_ha_state transient state to be pass
# * but this might also help 1858, if this bypasses the assumed transient period of power-off and releasing the locks
# * Note!
# * As a further improvement
# * could(?) wait for any tasks for the VM to finish. For example, if vMotion decided to migrate the machine just in the "wrong" spot of time.
# * could try freezing the vm to a host, to prevent VMotion from interventing - in case it causes some of the problem(s)
# * for 1858, it was first tried to wait for the ".lck-"-files to disappear by using ansible modules and also via the "/folder/"-url of vCenter. Unfortunately, any dot-files are not listed via those APIs :(.
# * This step waits only if the vSphere HA is enabled!
# * so this task does not wait in our PET-env, which does not have vSphere HA. But that's OK: 1) 1820 is triggered by vSphere HA, so in PET-env we should not see it 2) PET-env does not use NFS3-datastores, which triggers 1858.
- name: wait power off transient state to complete(?)
vmware_guest_info:
hostname: "{{ vcenter.hostname }}"
username: "{{ vcenter.username }}"
password: "{{ vcenter.password }}"
datacenter: "{{ vcenter_datacenter }}"
folder: "{{ vcenter_root_folder }}/{{ vcenter_folder }}"
name: "{{ inventory_hostname }}"
validate_certs: no
when: "'instance' in vm_info and vm_info.instance.hw_product_uuid is defined"
delegate_to: localhost
register: state
until: "state.instance.hw_guest_ha_state is none"
retries: 10
delay: 5
#- name: Delete VM
# hosts: snapshooted_vms
# gather_facts: no
# serial: 1 #workaround for github:ansible/ansible:#37254
# tasks:
- name: delete the VM
vmware_guest:
hostname: "{{ vcenter.hostname }}"
username: "{{ vcenter.username }}"
password: "{{ vcenter.password }}"
validate_certs: False
datacenter: "{{ vcenter_datacenter }}"
# folder: "{{ vcenter_root_folder }}/{{ vcenter_folder }}"
# name: "{{ inventory_hostname }}"
name: # workaround for ansible/ansible:#32901
uuid: "{{ vm_info.instance.hw_product_uuid }}"
state: absent
force: yes #ie. poweroff before delete..github:ansible/ansible:#37000
when: "'instance' in vm_info and vm_info.instance.hw_product_uuid is defined"
delegate_to: localhost
# work-around for S17SD-5545 and https://github.com/ansible-collections/community.vmware/issues/1858
# * delete the vm folder from datastore. fileManager.DeleteFile() is recursive, so all the hanging ".lck-xxx"-files are removed too
# * unfortunately, there was no rest API in vCenter -nor ansible module to do this. And well ... of course "vmware_guest: state=absent" should already do all of this.
- name: delete the hanging locks and datastore folder(brute force, workaround for S17SD-5545)
command: /usr/bin/python2
args:
stdin: |
from pyVim.task import WaitForTask
from pyVim.connect import SmartConnectNoSSL,Disconnect
from pyVmomi import vim
import atexit
def get_obj(content, vimtype, name):
"""
Return an object by name, if name is None the
first found object is returned
"""
obj = None
container = content.viewManager.CreateContainerView(
content.rootFolder, vimtype, True)
for c in container.view:
if name:
if c.name == name:
obj = c
break
else:
obj = c
break
return obj
serviceInstance = SmartConnectNoSSL(host="{{ vcenter.hostname }}",
user="{{ vcenter.username }}",
pwd="{{ vcenter.password }}",
port=443)
atexit.register(Disconnect, serviceInstance)
content = serviceInstance.RetrieveContent()
dc_obj = get_obj(content, [vim.Datacenter], "{{ vcenter_datacenter }}")
try:
WaitForTask(content.fileManager.DeleteFile("{{ vm_data_store_folder }}", dc_obj))
except vim.fault.FileNotFound as e:
print ( "{{ folder_already_absent_msg }}" )
register: delete_result
delegate_to: localhost
changed_when: not folder_already_absent_msg in delete_result.stdout
vars:
vm_data_store_folder: "{{ vm_info.instance.hw_files | select('search', inventory_hostname ) | first | dirname }}"
folder_already_absent_msg: "folder is already absent, which is OK."
until: delete_result is success
retries: 10
delay: 6
when: "'instance' in vm_info and vm_info.instance.hw_product_uuid is defined"
# - debug:
# var: delete_result
# S17SD-5545: Waiting for the folder to be removed was considered one option as a work-around
# * but, the problem is (propably) caused by the power-off, which (probably) leaves lock files hanging, and because of the lock files the datastore-folder ( probably ) can't be removed when the vm is deleted.
# * ...so waiting for the datastore-folder to disappear after deleting the VM was assumed to be useless - because it probably would not disappear.
# - name: wait until the folder is removed
# vsphere_file:
# hostname: "{{ vcenter.hostname }}"
# username: "{{ vcenter.username }}"
# password: "{{ vcenter.password }}"
# datacenter: "{{ vcenter_datacenter }}"
# datastore: "{{ item }}"
# validate_certs: no
# path: inventory_hostname
# state: file
# delegate_to: localhost
# register: datastore_folder
# failed_when: datastore_folder.state != 'absent'
# until: datastore_folder is success
# retries: 10
# delay: 6
# loop:
# - "{{ vcenter_datastore }}_01"
# - "{{ vcenter_datastore }}_02"
# when: "'instance' in vm_info and vm_info.instance.hw_product_uuid is defined"
#- name: Confirm VM deletion (vmware_guest:absent should guarantee it, though )
# hosts: snapshooted_vms
# gather_facts: no
# tasks:
# - name: confirm/wait that VMs are really deleted(vmware_guest in prev task *should* actually do that already...)
# vsphere_guest:
# vcenter_hostname: "{{ vcenter.hostname }}"
# username: "{{ vcenter.username }}"
# password: "{{ vcenter.password }}"
# guest: "{{ inventory_hostname }}"
# vmware_guest_facts: yes
# validate_certs: no
# register: vm_info
# failed_when: false # this skips the crit error returned by vspere_guest module ...so we can actually check the "until"-condition
# until: "'No such VM' in vm_info.msg|default('')"
# retries: 2
# delay: 5
# delegate_to: localhost
# - debug:
# var: vm_info
# - name: fail if VM is not non-existent
# fail:
# msg: "VM was still there...or we failed to find out the status properly."
# when: "'No such VM' not in vm_info.msg|default('')"
SUMMARY
We recently noticed that our datastores are polluted with folders like my-vm,my-vm_1,my-vm_2, etc. That is, the virtual machine name with an underscore and number suffix added to it. Only the most recent contains the actual files for the virtual machine. The older folders contain only vmware lock files like: ".lck-0f0xxxxxxxxx"
Info about the vmware lock files:
We are using vmware_guest-module to create and delete virtual machines in continous integration rounds. Only some of the test rounds are leaving these left-overs, so the leakage is caused by some sort of race condition.
We are strongly suspecting that deleting of the virtual machine is causing this:
The virtual machine is powered off by the module, and then it deletes the machine. We suspect that the power-off is not properly completed so that all the locks would be removed, when the delete already starts. And we suspect this causes the delete to be incomplete.
After delete, the virtual machines are recreated, and now that there already exists a folder with the virtual machines name and one lock-file in it, vmware decides to create a new folder with "_"-suffix.
The problem is not causing any immediate failure for any of the operations - but the folders and lock files are making the datastore messy.
ISSUE TYPE
COMPONENT NAME
vmware_guest
ANSIBLE VERSION
COLLECTION VERSION
CONFIGURATION
OS / ENVIRONMENT
vCenter 8 python2-pyvmomi-7.0.1-2.el7
STEPS TO REPRODUCE
We are running tens of builds per day, and each of the builds is executing ansible modules tens/hundreds of times. We have no exact steps to reproduce because the problem is sporadic and happens only sometimes. We already got tens/hundreds of these left-over folders now, in our datastore.
EXPECTED RESULTS
vmware_guest should handle the poweroff and delete so that there would be no leftovers for the deleted vm in datastore.
ACTUAL RESULTS
After deleting the vm, the virtual machine folder is sometimes left in the datastore with a lock file in it.