dmacvicar / terraform-provider-libvirt

Terraform provider to provision infrastructure with Linux's KVM using libvirt
Apache License 2.0
1.54k stars 457 forks source link

Remove artifacts when pool or domain creation fail #897

Open fmoor opened 2 years ago

fmoor commented 2 years ago

When pool or domain creation fail there may be artifacts that were created that cause rerunning the terraform plan against the same host to fail. This fixes a similar problem as described in https://github.com/dmacvicar/terraform-provider-libvirt/issues/739

System Information

Linux distribution

Ubuntu

Terraform version

$ terraform -v
Terraform v1.0.9
on linux_amd64

Provider and libvirt versions

$ terraform-provider-libvirt -version
0.6.11
dmacvicar commented 2 years ago

What artifacts are you referring too?

Do you have an example or sample output of this situation?

fmoor commented 2 years ago

The default qemu configuration on my laptop causes domain creation to fail with a Permision denied error. After fixing the configuration terraform apply fails with:

╷
│ Error: Error defining libvirt domain: operation failed: domain 'consul_node_2' already exists with uuid 4f3dc245-706e-4065-b075-2a25a9383ee6
│
│   with module.consul-server.libvirt_domain.consul_node[2],
│   on ../modules/consul-libvirt/consul-server.tf line 68, in resource "libvirt_domain" "consul_node":
│   68: resource "libvirt_domain" "consul_node" {
│
╵
╷
│ Error: Error defining libvirt domain: operation failed: domain 'consul_node_0' already exists with uuid cb944a87-3d95-415e-ac8c-f5a70cf4cb12
│
│   with module.consul-server.libvirt_domain.consul_node[0],
│   on ../modules/consul-libvirt/consul-server.tf line 68, in resource "libvirt_domain" "consul_node":
│   68: resource "libvirt_domain" "consul_node" {
│
╵
╷
│ Error: Error defining libvirt domain: operation failed: domain 'consul_node_1' already exists with uuid 3c5c5033-d4af-4d2c-92dd-7a55b7b4c21c
│
│   with module.consul-server.libvirt_domain.consul_node[1],
│   on ../modules/consul-libvirt/consul-server.tf line 68, in resource "libvirt_domain" "consul_node":
│   68: resource "libvirt_domain" "consul_node" {
│
╵

This is because the libvirt provider didn't clean up the domains that encountered permissions errors during creation.

$ virsh list --all
 Id   Name            State
--------------------------------
 -    consul_node_0   shut off
 -    consul_node_1   shut off
 -    consul_node_2   shut off

Running terraform destroy does not delete the domains because they were never added to the terraform state. Undefining the domains using virsh and then running terraform apply works as expected.

dmacvicar commented 2 years ago

So, if with this logic somebody by accident sets the same name of a running workload, creation will fail and we will both destroy and undefine this workload?

fmoor commented 2 years ago

So, if with this logic somebody by accident sets the same name of a running workload, creation will fail and we will both destroy and undefine this workload?

Name collision is detected earlier in the resource creation flow (when the xml is defined). This change only cleans up when creation fails not definition, so I don't think there is danger of destroying or undefining something that is not managed by the current terraform config.

JSmith-Aura commented 1 year ago

I've just also been hit by this, on failure to attach my libvirt guest to a network terraform bailed out by left behind the definition of a virtual machine, thus when I ran terraform apply again it resulted in a name collision.