Proxmox v7.4: Sporadic failure `unable to read tail (got 0 bytes)` on teardown of VMs

AlexFernandes-MOVAI commented 6 months ago

This issue is just an open question to know what can cause this error and if the issue is fixed in later version of bpg provider or proxmox ....

Bug Description

On a production Proxmox server running version 7.4, we randomly run into a teardown error of type imgdel:local:ci@pam: unable to read tail (got 0 bytes)

To Reproduce Steps to reproduce the behavior:

Create multiple resources of type proxmox_virtual_environment_vm with a single terraform apply
Run the VMs for some time 1-20min
Destroy the resources with terraform destroy
Teardown fail with error mentionned above

The terraform used looks like the one below, where some variables are defined:


resource "proxmox_virtual_environment_vm" "fleet_manager" {
  name            = var.fleet_manager_name
  description     = "Managed by Terraform"
  tags            = var.tags
  node_name       = var.proxmox_host_list[0]
  pool_id         = var.pool
  scsi_hardware   = var.scsihw
  stop_on_destroy = true
  started         = true
  on_boot         = false

  cpu {
    cores = var.fleet_manager_cores
    type  = var.vm_core_type
  }

  memory {
    dedicated = var.fleet_manager_memory
    floating  = var.fleet_manager_balloon
  }

  agent {
    enabled = true
  }

  machine = var.vm_type
  bios    = var.bios

  network_device {
    bridge = var.vm_network_bridge
  }

  disk {
    datastore_id = var.vm_storage
    file_id      = var.fleet_manager_img_id
    interface    = var.vm_disk_interface
    size         = var.fleet_manager_disk_size
    iothread     = true
  }

  serial_device {}
  vga {
    enabled = true
  }

  dynamic "hostpci" {
    for_each = var.fleet_manager_enable_hostpci ? [1] : []
    content {
      device = var.fleet_manager_enable_hostpci ? var.hostpci_device : null
      id     = var.fleet_manager_enable_hostpci ? var.hostpci_device_id : null
      pcie   = var.fleet_manager_enable_hostpci ? var.hostpci_device_pcie : null
      xvga   = var.fleet_manager_enable_hostpci ? var.hostpci_device_xvga : null
    }
  }

  operating_system {
    type = var.vm_os_type
  }

  initialization {
    datastore_id      = var.cloud_init_storage
    user_data_file_id = proxmox_virtual_environment_file.cloud_config_main.id

    ip_config {
      ipv4 {
        address = var.ip_list[0]
        gateway = var.ip_list[0] != "dhcp" ? var.static_ip_gateway : null
      }
    }
  }
  provisioner "local-exec" {
    when    = create
    command = "sleep ${var.startup_wait_for_ip}"
  }
}

Expected behavior The terraform destroy should always succeed without failures

Logs

May 02 09:28:45 hel pvedaemon[1838702]: <ci@pam> starting task UPID:hel:001C3146:032B2106:66335CCD:qmdestroy:109:ci@pam:
May 02 09:28:45 hel pvedaemon[1847622]: destroy VM 109: UPID:hel:001C3146:032B2106:66335CCD:qmdestroy:109:ci@pam:
May 02 09:28:45 hel pvedaemon[1841203]: <ci@pam> starting task UPID:hel:001C3147:032B2106:66335CCD:qmdestroy:111:ci@pam:
May 02 09:28:45 hel pvedaemon[1847623]: destroy VM 111: UPID:hel:001C3147:032B2106:66335CCD:qmdestroy:111:ci@pam:
May 02 09:28:45 hel pvedaemon[1838829]: <ci@pam> starting task UPID:hel:001C314A:032B2107:66335CCD:qmdestroy:107:ci@pam:
May 02 09:28:45 hel pvedaemon[1847626]: destroy VM 107: UPID:hel:001C314A:032B2107:66335CCD:qmdestroy:107:ci@pam:
May 02 09:28:45 hel pvedaemon[1838702]: <ci@pam> end task UPID:hel:001C3146:032B2106:66335CCD:qmdestroy:109:ci@pam: OK
May 02 09:28:45 hel pvedaemon[1841203]: <ci@pam> end task UPID:hel:001C3147:032B2106:66335CCD:qmdestroy:111:ci@pam: OK
May 02 09:28:46 hel pvedaemon[1838829]: <ci@pam> end task UPID:hel:001C314A:032B2107:66335CCD:qmdestroy:107:ci@pam: OK
May 02 09:28:47 hel pvedaemon[1841203]: <ci@pam> starting task UPID:hel:001C3158:032B21D1:66335CCF:imgdel:local:ci@pam:
May 02 09:28:47 hel pvedaemon[1838702]: <ci@pam> starting task UPID:hel:001C3159:032B21D1:66335CCF:imgdel:local:ci@pam:
May 02 09:28:47 hel pvedaemon[1838702]: <ci@pam> end task UPID:hel:001C3159:032B21D1:66335CCF:imgdel:local:ci@pam: OK
May 02 09:28:47 hel pvedaemon[1841203]: <ci@pam> end task UPID:hel:001C3158:032B21D1:66335CCF:imgdel:local:ci@pam: OK
May 02 09:28:47 hel pvedaemon[1841203]: <ci@pam> end task UPID:hel:001C315C:032B21D2:66335CCF:imgdel:local:ci@pam: unable to read tail (got 0 bytes)

Single or clustered Proxmox: Single
Proxmox version: 7.4-17
Provider version: bpg/proxmox 0.52.0
Terraform/OpenTofu version: ">= 0.12.14"
OS: Ubuntu 22.04

bpg commented 6 months ago

Hey @AlexFernandes-MOVAI 👋🏼

Honestly, no much ideas what is causing this. It looks like you're simultaneously deleting at least 3 VMs, so there could be some race conditions in the PVE. Or perhaps IO bottleneck on your storage, and the task inside PVE times out.

The provider does not do much in regards of the resource destruction, it just submits a task and waits for its completion.

As an experiment you may try with different parallelism value, and see if reducing it helps.

AlexFernandes-MOVAI commented 6 months ago

Thanks for the quick feedback @bpg and the hardwork on this repo which is helping us a lot for a few months now.

I don't believe the storage can be the issue since it is an internal SSD drive. I will try to play with parallelism and post here the conclusions

bpg commented 5 months ago

Hm... In fairness, if the VM deletion completes without other errors, we could probably just ignore this status and assume the task has successfully finished.

bpg / terraform-provider-proxmox

Proxmox v7.4: Sporadic failure `unable to read tail (got 0 bytes)` on teardown of VMs #1352