canonical / microcloud

Automated private cloud based on LXD, Ceph and OVN
https://microcloud.is
GNU Affero General Public License v3.0
260 stars 36 forks source link

Cannot delete instances - image has watchers - not removing #297

Open VariableDeclared opened 2 months ago

VariableDeclared commented 2 months ago

Hello

When trying to delete instances on Microcloud the instances fail to delete with the following error:

Error: Failed deleting instance "private-repo-lds-3" in project "REDACTED_PROJECT_NAME": Error deleting storage volume: Failed to delete volume: Failed to run: rbd --id admin --cluster ceph --pool lxd_remote rm virtual-machine_REDACTED_PROJECT_NAME_private-repo-lds-3.block: exit status 16 (2024-04-29T11:02:41.760+0000 7fe2f4898640 -1 librbd::image::PreRemoveRequest: 0x5563e888b7b0 check_image_watchers: image has watchers - not removing
Removing image: 0% complete...failed.
rbd: error: image still has watchers
This means the image is still open or the client using it crashed. Try again after closing/unmapping it or waiting 30s for the crashed client to timeout.)

The issue was produced by deploying a set of 14 VMs, with the following config: https://pastebin.canonical.com/p/DmfDtKc6cz/

The VMs were deployed on Friday, and left over the weekend. When destroying the VMs then failed with the above error

Workaround

  1. sudo ps aux | grep qemu
  2. Identify the process for your VMs
  3. sudo kill ${PID}

Peter

tomponline commented 2 months ago

VMs were found to have crashed, killing qemu processes released the rbd volumes to allow deletion.

VariableDeclared commented 2 months ago

The steps to reproduce this:

  1. Create VMs as described
  2. Add a new network VLAN to bond on which LXD is running its services via netplan, e.g. vlan with ID 55
  3. Apply VLAN changes
  4. Allow cluster to settle
  5. Attempt removal

These steps are what I can gather has happened since I used the environment. I need to validate this and confirm minimal reproducer

Thank you Peter

VariableDeclared commented 2 months ago

The steps to reproduce this:

  1. Create VMs as described
  2. Add a new network VLAN to bond on which LXD is running its services via netplan, e.g. vlan with ID 55
  3. Apply VLAN changes
  4. Allow cluster to settle
  5. Attempt removal

These steps are what I can gather has happened since I used the environment. I need to validate this and confirm minimal reproducer

Thank you Peter

Following up here - I am struggling to validate the above steps as a reproducer. I tried adding a VLAN, and I do see errors from ceph, but still VM deletion is possible.