bpg / terraform-provider-proxmox

Terraform / OpenTofu Provider for Proxmox VE
https://registry.terraform.io/providers/bpg/proxmox
Mozilla Public License 2.0
899 stars 140 forks source link

Proxmox v7.4: Sporadic failure `unable to read tail (got 0 bytes)` on teardown of VMs #1352

Open AlexFernandes-MOVAI opened 6 months ago

AlexFernandes-MOVAI commented 6 months ago

This issue is just an open question to know what can cause this error and if the issue is fixed in later version of bpg provider or proxmox ....

Bug Description

On a production Proxmox server running version 7.4, we randomly run into a teardown error of type imgdel:local:ci@pam: unable to read tail (got 0 bytes)

To Reproduce Steps to reproduce the behavior:

  1. Create multiple resources of type proxmox_virtual_environment_vm with a single terraform apply
  2. Run the VMs for some time 1-20min
  3. Destroy the resources with terraform destroy
  4. Teardown fail with error mentionned above

The terraform used looks like the one below, where some variables are defined:


resource "proxmox_virtual_environment_vm" "fleet_manager" {
  name            = var.fleet_manager_name
  description     = "Managed by Terraform"
  tags            = var.tags
  node_name       = var.proxmox_host_list[0]
  pool_id         = var.pool
  scsi_hardware   = var.scsihw
  stop_on_destroy = true
  started         = true
  on_boot         = false

  cpu {
    cores = var.fleet_manager_cores
    type  = var.vm_core_type
  }

  memory {
    dedicated = var.fleet_manager_memory
    floating  = var.fleet_manager_balloon
  }

  agent {
    enabled = true
  }

  machine = var.vm_type
  bios    = var.bios

  network_device {
    bridge = var.vm_network_bridge
  }

  disk {
    datastore_id = var.vm_storage
    file_id      = var.fleet_manager_img_id
    interface    = var.vm_disk_interface
    size         = var.fleet_manager_disk_size
    iothread     = true
  }

  serial_device {}
  vga {
    enabled = true
  }

  dynamic "hostpci" {
    for_each = var.fleet_manager_enable_hostpci ? [1] : []
    content {
      device = var.fleet_manager_enable_hostpci ? var.hostpci_device : null
      id     = var.fleet_manager_enable_hostpci ? var.hostpci_device_id : null
      pcie   = var.fleet_manager_enable_hostpci ? var.hostpci_device_pcie : null
      xvga   = var.fleet_manager_enable_hostpci ? var.hostpci_device_xvga : null
    }
  }

  operating_system {
    type = var.vm_os_type
  }

  initialization {
    datastore_id      = var.cloud_init_storage
    user_data_file_id = proxmox_virtual_environment_file.cloud_config_main.id

    ip_config {
      ipv4 {
        address = var.ip_list[0]
        gateway = var.ip_list[0] != "dhcp" ? var.static_ip_gateway : null
      }
    }
  }
  provisioner "local-exec" {
    when    = create
    command = "sleep ${var.startup_wait_for_ip}"
  }
}

Expected behavior The terraform destroy should always succeed without failures

Logs

May 02 09:28:45 hel pvedaemon[1838702]: <ci@pam> starting task UPID:hel:001C3146:032B2106:66335CCD:qmdestroy:109:ci@pam:
May 02 09:28:45 hel pvedaemon[1847622]: destroy VM 109: UPID:hel:001C3146:032B2106:66335CCD:qmdestroy:109:ci@pam:
May 02 09:28:45 hel pvedaemon[1841203]: <ci@pam> starting task UPID:hel:001C3147:032B2106:66335CCD:qmdestroy:111:ci@pam:
May 02 09:28:45 hel pvedaemon[1847623]: destroy VM 111: UPID:hel:001C3147:032B2106:66335CCD:qmdestroy:111:ci@pam:
May 02 09:28:45 hel pvedaemon[1838829]: <ci@pam> starting task UPID:hel:001C314A:032B2107:66335CCD:qmdestroy:107:ci@pam:
May 02 09:28:45 hel pvedaemon[1847626]: destroy VM 107: UPID:hel:001C314A:032B2107:66335CCD:qmdestroy:107:ci@pam:
May 02 09:28:45 hel pvedaemon[1838702]: <ci@pam> end task UPID:hel:001C3146:032B2106:66335CCD:qmdestroy:109:ci@pam: OK
May 02 09:28:45 hel pvedaemon[1841203]: <ci@pam> end task UPID:hel:001C3147:032B2106:66335CCD:qmdestroy:111:ci@pam: OK
May 02 09:28:46 hel pvedaemon[1838829]: <ci@pam> end task UPID:hel:001C314A:032B2107:66335CCD:qmdestroy:107:ci@pam: OK
May 02 09:28:47 hel pvedaemon[1841203]: <ci@pam> starting task UPID:hel:001C3158:032B21D1:66335CCF:imgdel:local:ci@pam:
May 02 09:28:47 hel pvedaemon[1838702]: <ci@pam> starting task UPID:hel:001C3159:032B21D1:66335CCF:imgdel:local:ci@pam:
May 02 09:28:47 hel pvedaemon[1838702]: <ci@pam> end task UPID:hel:001C3159:032B21D1:66335CCF:imgdel:local:ci@pam: OK
May 02 09:28:47 hel pvedaemon[1841203]: <ci@pam> end task UPID:hel:001C3158:032B21D1:66335CCF:imgdel:local:ci@pam: OK
May 02 09:28:47 hel pvedaemon[1841203]: <ci@pam> end task UPID:hel:001C315C:032B21D2:66335CCF:imgdel:local:ci@pam: unable to read tail (got 0 bytes)
bpg commented 6 months ago

Hey @AlexFernandes-MOVAI 👋🏼

Honestly, no much ideas what is causing this. It looks like you're simultaneously deleting at least 3 VMs, so there could be some race conditions in the PVE. Or perhaps IO bottleneck on your storage, and the task inside PVE times out.

The provider does not do much in regards of the resource destruction, it just submits a task and waits for its completion.

As an experiment you may try with different parallelism value, and see if reducing it helps.

AlexFernandes-MOVAI commented 6 months ago

Thanks for the quick feedback @bpg and the hardwork on this repo which is helping us a lot for a few months now.

I don't believe the storage can be the issue since it is an internal SSD drive. I will try to play with parallelism and post here the conclusions

bpg commented 5 months ago

Hm... In fairness, if the VM deletion completes without other errors, we could probably just ignore this status and assume the task has successfully finished.