dmacvicar / terraform-provider-libvirt

Terraform provider to provision infrastructure with Linux's KVM using libvirt
Apache License 2.0
1.6k stars 459 forks source link

"wait_for_lease = true" is broken in versions greater than 0.7.1 #1091

Open SJFCS opened 3 months ago

SJFCS commented 3 months ago

System Information

Linux distribution

Archlinux

Terraform version

terraform -v
Terraform v1.9.4
on linux_amd64

Provider and libvirt versions

This issue can be reproduced in versions greater than 0.7.1 Tested version: 0.7.1 normal 0.7.4 broken 0.7.6 broken 0.8.1 broken

Description of Issue/Question

If it is greater than 0.7.1, will not wait to obtain IP. You won't see this prompt "Still creating... [10s elapsed]"

│ Error: couldn't retrieve IP address of domain id: 3ac397de-13cd-485d-9772-872f7652de0d. Please check following: 
│ 1) is the domain running proplerly? 
│ 2) has the network interface an IP address? 
│ 3) Networking issues on your libvirt setup? 
│  4) is DHCP enabled on this Domain's network? 
│ 5) if you use bridge network, the domain should have the pkg qemu-agent installed 
│ IMPORTANT: This error is not a terraform libvirt-provider error, but an error caused by your KVM/libvirt infrastructure configuration/setup 
│  error retrieving interface addresses: error retrieving interface addresses: Virtual machine agent not responding: QEMU host agent not connected

I found that this is not related to whether the network mode is bridge or nat. To simplify the reproduction process and avoid cloudinit interference, I used the Talos ISO boot image below, which includes qemu-guest-agent and can be booted directly as a boot disk.

The metal-amd64.iso (MD5: ebd98e402606991700d8cb5545e72673) can be downloaded from: https://factory.talos.dev/image/ce4c980550dd2ab1b17bbf2b08801c7eb59418eafe8f279833297925d67c7515/v1.8.2/metal-amd64.iso

You can also build it yourself here: https://factory.talos.dev/ -> Bare-metal Machine -> choose version -> amd64 -> choose System Extensions qemu-guest-agent

#=====================================================================================
# Providers
#=====================================================================================
terraform {
  required_version = ">= 1.6.0"
  required_providers {
    libvirt = {
      source  = "dmacvicar/libvirt"
      version = "0.7.4"
    }
    template = {
      source  = "hashicorp/template"
      version = "2.2.0"
    }
  }
}

provider "libvirt" {
  uri = "qemu:///system"
}

#=====================================================================================
# Libvirt Pool
#=====================================================================================
resource "libvirt_pool" "kubernetes" {
  name = "talos"
  type = "dir"
  path = "/opt/libvirt-pool/talos"
}

#=====================================================================================
# Network
#=====================================================================================
# resource "libvirt_network" "talos" {
#   name      = "talos"
#   mode      = "bridge"
#   bridge    = "br0" # Use the created bridge network card
#   autostart = true
# }
resource "libvirt_network" "talos" {
  name      = "talos"
  mode      = "nat"
  addresses = ["192.168.123.0/24"]
  autostart = true
}
#=====================================================================================
# Domain
#=====================================================================================
resource "libvirt_domain" "domain-talos" {
  name   = "talos"
  memory = "2048"
  vcpu   = 4
  cpu {
    mode = "host-passthrough"
  }

  qemu_agent = true

  boot_device {
    dev = ["cdrom", "hd", "network"]
  }
  network_interface {
    network_id     = libvirt_network.talos.id
    wait_for_lease = true
  }

  # cdrom
  disk {
    file = "/home/admin/Downloads/images/metal-amd64.iso"
  }
  #=====================================================================================
  # Console
  #=====================================================================================
  console {
    type        = "pty"
    target_port = "0"
    target_type = "serial"
  }

  console {
    type        = "pty"
    target_type = "virtio"
    target_port = "1"
  }

  graphics {
    type        = "spice"
    listen_type = "address"
    autoport    = true
  }
  video {
    type = "virtio"
  }
}

# Output the IP addresses
output "ips" {
  value = {
    ip = libvirt_domain.domain-talos.network_interface[0].addresses
  }
}

Reproduction steps

#     set  version = "0.7.1"
terraform init
terraform apply -auto-approve
terraform destroy -auto-approve
# it work !
#     set  version = "0.7.1"
terraform init -upgrade
terraform apply -auto-approve
# it err !
###

when use 0.7.6 debug: TF_LOG=DEBUG terraform apply -auto-approve

          </graphics>
          <rng model="virtio">
              <backend model="random">/dev/urandom</backend>
          </rng>
      </devices>
  </domain>: timestamp="2024-08-14T23:28:10.086+0800"
2024-08-14T23:28:10.435+0800 [INFO]  provider.terraform-provider-libvirt_v0.7.6: 2024/08/14 23:28:10 [INFO] Domain ID: 1e643687-5914-469d-b5c8-356c5dc65790: timestamp="2024-08-14T23:28:10.435+0800"
2024-08-14T23:28:10.435+0800 [INFO]  provider.terraform-provider-libvirt_v0.7.6: 2024/08/14 23:28:10 [DEBUG] Waiting for state to become: [all-addresses-obtained]: timestamp="2024-08-14T23:28:10.435+0800"
2024-08-14T23:28:15.441+0800 [INFO]  provider.terraform-provider-libvirt_v0.7.6: 2024/08/14 23:28:15 [DEBUG] waiting for network address for iface=52:54:00:16:93:28: timestamp="2024-08-14T23:28:15.440+0800"
2024-08-14T23:28:15.441+0800 [INFO]  provider.terraform-provider-libvirt_v0.7.6: 2024/08/14 23:28:15 [DEBUG] qemu-agent used to query interface info: timestamp="2024-08-14T23:28:15.441+0800"
2024-08-14T23:28:15.443+0800 [ERROR] provider.terraform-provider-libvirt_v0.7.6: Response contains error diagnostic: diagnostic_severity=ERROR tf_proto_version=5.3 tf_provider_addr=provider @caller=github.com/hashicorp/terraform-plugin-go@v0.14.2/tfprotov5/internal/diag/diagnostics.go:55 @module=sdk.proto tf_req_id=3443beee-8402-aa9f-8e77-364a3bd03a5e tf_resource_type=libvirt_domain tf_rpc=ApplyResourceChange diagnostic_detail=""
  diagnostic_summary=
  | couldn't retrieve IP address of domain id: 1e643687-5914-469d-b5c8-356c5dc65790. Please check following: 
  | 1) is the domain running properly? 
  | 2) has the network interface an IP address? 
  | 3) Networking issues on your libvirt setup? 
  |  4) is DHCP enabled on this Domain's network? 
  | 5) if you use bridge network, the domain should have the pkg qemu-agent installed 
  | IMPORTANT: This error is not a terraform libvirt-provider error, but an error caused by your KVM/libvirt infrastructure configuration/setup 

0.7.1 is work step:

  1. just change this:
    libvirt = {
      source  = "dmacvicar/libvirt"
      version = "0.7.1"
    }
  2. terraform init -upgrade
  3. TF_LOG=DEBUG terraform apply -auto-approve

when use 0.7.1 debug:

2024-08-14T23:26:07.000+0800 [INFO]  provider.terraform-provider-libvirt_v0.7.1: 2024/08/14 23:26:07 [DEBUG] waiting for network address for iface=52:54:00:7E:A5:63: timestamp="2024-08-14T23:26:07.000+0800"
2024-08-14T23:26:07.000+0800 [INFO]  provider.terraform-provider-libvirt_v0.7.1: 2024/08/14 23:26:07 [DEBUG] qemu-agent used to query interface info: timestamp="2024-08-14T23:26:07.000+0800"
2024-08-14T23:26:07.001+0800 [INFO]  provider.terraform-provider-libvirt_v0.7.1: 2024/08/14 23:26:07 [DEBUG] Interfaces info obtained with libvirt API:
([]libvirt.DomainInterface) <nil>: timestamp="2024-08-14T23:26:07.001+0800"
2024-08-14T23:26:07.001+0800 [INFO]  provider.terraform-provider-libvirt_v0.7.1: 2024/08/14 23:26:07 [DEBUG] ifaces with addresses: []: timestamp="2024-08-14T23:26:07.001+0800"
2024-08-14T23:26:07.001+0800 [INFO]  provider.terraform-provider-libvirt_v0.7.1: 2024/08/14 23:26:07 [DEBUG] 52:54:00:7E:A5:63 doesn't have IP address(es) yet...: timestamp="2024-08-14T23:26:07.001+0800"
2024-08-14T23:26:07.001+0800 [INFO]  provider.terraform-provider-libvirt_v0.7.1: 2024/08/14 23:26:07 [DEBUG] IP address not found for iface=52:54:00:7E:A5:63: will try in a while: timestamp="2024-08-14T23:26:07.001+0800"
2024-08-14T23:26:07.001+0800 [INFO]  provider.terraform-provider-libvirt_v0.7.1: 2024/08/14 23:26:07 [TRACE] Waiting 10s before next try: timestamp="2024-08-14T23:26:07.001+0800"
libvirt_domain.domain-ubuntu: Still creating... [40s elapsed]
2024-08-14T23:26:17.010+0800 [INFO]  provider.terraform-provider-libvirt_v0.7.1: 2024/08/14 23:26:17 [DEBUG] waiting for network address for iface=52:54:00:7E:A5:63: timestamp="2024-08-14T23:26:17.010+0800"
2024-08-14T23:26:17.010+0800 [INFO]  provider.terraform-provider-libvirt_v0.7.1: 2024/08/14 23:26:17 [DEBUG] qemu-agent used to query interface info: timestamp="2024-08-14T23:26:17.010+0800"
2024-08-14T23:26:17.013+0800 [INFO]  provider.terraform-provider-libvirt_v0.7.1: 2024/08/14 23:26:17 [DEBUG] Interfaces info obtained with libvirt API:
([]libvirt.DomainInterface) (len=2 cap=2) {

(Include debug logs if possible and relevant).

Related issues issue #1050 #1028 #1037


Additional information:

The environment is the same, only the provider version is different

scabala commented 2 months ago

Hello, could you try to get an specify wait_For_lease using an image that already has qemu-guest-agent installed? I had successfully get IP address from VM when doing so.

SJFCS commented 2 months ago

Hello, could you try to get an specify wait_For_lease using an image that already has qemu-guest-agent installed? I had successfully get IP address from VM when doing so.

Thank you for the method you provided I haven't tried to use an image with qemu-guest-agent already installed because I want qemu-guest-agent to be installed automatically during the cloudinit phase, which was possible in previous versions but will not work in the new version

scabala commented 2 months ago

I'll try to take a look and see if I can find anything changed that might cause it between those two versions.

scabala commented 2 months ago

I couldn't find anything particular between those versions. Also, I don't have bridged network in my setup and it's hard for me to create it so I used NAT-ed one and I couldn't reproduce it.

@SJFCS could you check if you can reproduce it in different network types? NAT-ed and routed for example?

EDIT: forget what I wrote, I can reproduce it, just used wrong image before :facepalm:

I'll try to bisect and see where problem lies

scabala commented 2 months ago

Okay, more debugging later: I cannot reproduce it - previously I had problems with cloud-init. I think it might be related to cloud-init itself rather than to provider.

Either way, I have consisten behavior between 0.7.6 and 0.7.1 - it's either failing if qemu-guest-agent is not installed and started or it is running fine otherwise.

SJFCS commented 2 months ago

I couldn't find anything particular between those versions. Also, I don't have bridged network in my setup and it's hard for me to create it so I used NAT-ed one and I couldn't reproduce it.

@SJFCS could you check if you can reproduce it in different network types? NAT-ed and routed for example?

EDIT: forget what I wrote, I can reproduce it, just used wrong image before 🤦

I'll try to bisect and see where problem lies

The network configuration is the same, I think it has nothing to do with this

SJFCS commented 2 months ago

Okay, more debugging later: I cannot reproduce it - previously I had problems with cloud-init. I think it might be related to cloud-init itself rather than to provider.

Either way, I have consisten behavior between 0.7.6 and 0.7.1 - it's either failing if qemu-guest-agent is not installed and started or it is running fine otherwise.

Okay, thanks for the troubleshooting, but I did only change the provider version number while keeping the configuration unchanged.

scabala commented 2 months ago

Do you have cloud-init logs for both scenarios?

SJFCS commented 1 month ago

Do you have cloud-init logs for both scenarios?

I have seen the logs in both cases, and they are normal and no errors are reported.

SJFCS commented 3 weeks ago

libv

This issue can be reproduced in versions greater than 0.7.1

│ Error: couldn't retrieve IP address of domain id: 3ac397de-13cd-485d-9772-872f7652de0d. Please check following: 
│ 1) is the domain running proplerly? 
│ 2) has the network interface an IP address? 
│ 3) Networking issues on your libvirt setup? 
│  4) is DHCP enabled on this Domain's network? 
│ 5) if you use bridge network, the domain should have the pkg qemu-agent installed 
│ IMPORTANT: This error is not a terraform libvirt-provider error, but an error caused by your KVM/libvirt infrastructure configuration/setup 
│  error retrieving interface addresses: error retrieving interface addresses: Virtual machine agent not responding: QEMU host agent not connected

I found that this is not related to whether the network mode is bridge or nat. To simplify the reproduction process and avoid cloudinit interference, I used the Talos ISO boot image below, which includes qemu-guest-agent and can be booted directly as a boot disk.

The metal-amd64.iso (MD5: ebd98e402606991700d8cb5545e72673) can be downloaded from: https://factory.talos.dev/image/ce4c980550dd2ab1b17bbf2b08801c7eb59418eafe8f279833297925d67c7515/v1.8.2/metal-amd64.iso

You can also build it yourself here: https://factory.talos.dev -> Bare-metal Machine -> choose version -> amd64 -> choose System Extensions qemu-guest-agent

#=====================================================================================
# Providers
#=====================================================================================
terraform {
  required_version = ">= 1.6.0"
  required_providers {
    libvirt = {
      source  = "dmacvicar/libvirt"
      version = "0.7.4"
    }
    template = {
      source  = "hashicorp/template"
      version = "2.2.0"
    }
  }
}

provider "libvirt" {
  uri = "qemu:///system"
}

#=====================================================================================
# Libvirt Pool
#=====================================================================================
resource "libvirt_pool" "kubernetes" {
  name = "talos"
  type = "dir"
  path = "/opt/libvirt-pool/talos"
}

#=====================================================================================
# Network
#=====================================================================================
# resource "libvirt_network" "talos" {
#   name      = "talos"
#   mode      = "bridge"
#   bridge    = "br0" # Use the created bridge network card
#   autostart = true
# }
resource "libvirt_network" "talos" {
  name      = "talos"
  mode      = "nat"
  addresses = ["192.168.123.0/24"]
  autostart = true
}
#=====================================================================================
# Domain
#=====================================================================================
resource "libvirt_domain" "domain-talos" {
  name   = "talos"
  memory = "2048"
  vcpu   = 4
  cpu {
    mode = "host-passthrough"
  }

  qemu_agent = true

  boot_device {
    dev = ["cdrom", "hd", "network"]
  }
  network_interface {
    network_id     = libvirt_network.talos.id
    wait_for_lease = true
  }

  # cdrom
  disk {
    file = "/home/admin/Downloads/images/metal-amd64.iso"
  }
  #=====================================================================================
  # Console
  #=====================================================================================
  console {
    type        = "pty"
    target_port = "0"
    target_type = "serial"
  }

  console {
    type        = "pty"
    target_type = "virtio"
    target_port = "1"
  }

  graphics {
    type        = "spice"
    listen_type = "address"
    autoport    = true
  }
  video {
    type = "virtio"
  }
}

# Output the IP addresses
output "ips" {
  value = {
    ip = libvirt_domain.domain-talos.network_interface[0].addresses
  }
}

Reproduction steps

#     set  version = "0.7.1"
terraform init
terraform apply -auto-approve
terraform destroy -auto-approve
# it work !
#     set  version = "0.7.4"
terraform init -upgrade
terraform apply -auto-approve
# it err !
NamelessOne91 commented 1 week ago

I wanted to have a look at this issue, but it seems I can reproduce it only with version 0.7.4

Versions 0.7.1, 0.7.6 and 0.8.1 are working fine for me. I pretty much copy-pasted your tf file in the previous comment, minus the template provider.

SJFCS commented 20 hours ago

I wanted to have a look at this issue, but it seems I can reproduce it only with version 0.7.4

Versions 0.7.1, 0.7.6 and 0.8.1 are working fine for me. I pretty much copy-pasted your tf file in the previous comment, minus the template provider.

i try it ,on 0.7.6 and 0.8.1 is not working too. ...