bpg / terraform-provider-proxmox

Terraform Provider for Proxmox
https://registry.terraform.io/providers/bpg/proxmox
Mozilla Public License 2.0
882 stars 140 forks source link

Proxmox VM Creation Fails with ‘Unable to Retrieve VM Identifier’ Error When Cloning Multiple VMs Simultaneously #1610

Closed arsensimonyanpicsart closed 5 days ago

arsensimonyanpicsart commented 3 weeks ago

Describe the bug When creating two or more virtual machines (VMs) using the , an error occurs indicating a failure to retrieve the next available VM identifier.

To Reproduce Steps to reproduce the behavior:

  1. Create a Proxmox resource with the following Terraform configuration.
  2. Run terraform apply (or tofu apply).
  3. Observe that vm2 is created successfully, but vm1 fails with an error related to VM identifier retrieval.
  4. Modify the resource '....'
  5. Run '....'
  6. See error

Please also provide a minimal Terraform configuration that reproduces the issue.


# >>> terraform {
  required_providers {
    proxmox = {
      source  = "bpg/proxmox"
      version = "0.66.3"
    }
  }
}
provider "proxmox" {
  endpoint  = "https://xxx:8006/"
  api_token = "xxx"
  insecure  = true
}

resource "proxmox_virtual_environment_vm" "vm1" {
  name            = "va-vm-name1"
  node_name       = "va-dev-proxmox01"
  stop_on_destroy = true

  clone {
    vm_id     = 9001
    full      = true
    node_name = "va-dev-proxmox02"
  }
}

resource "proxmox_virtual_environment_vm" "vm2" {
  name            = "va-vm-name2"
  node_name       = "va-dev-proxmox01"
  stop_on_destroy = true

  clone {
    vm_id     = 9001
    full      = true
    node_name = "va-dev-proxmox02"
  }
} <<< #

and the output of terraform|tofu apply.

proxmox_virtual_environment_vm.vm2: Creating...
proxmox_virtual_environment_vm.vm2: Still creating... [10s elapsed]
proxmox_virtual_environment_vm.vm2: Creation complete after 18s [id=115]
╷
│ Error: unable to retrieve the next available VM identifier: context deadline exceeded
│
│   with proxmox_virtual_environment_vm.vm1,
│   on main.tf line 16, in resource "proxmox_virtual_environment_vm" "vm1":
│   16: resource "proxmox_virtual_environment_vm" "vm1" {

Expected behavior Both vm1 and vm2 should be created successfully

Screenshots If applicable, add screenshots to help explain your problem.

Additional context Add any other context about the problem here.

ratiborusx commented 3 weeks ago

Getting the same error while trying to create 9 VMs in a Gitlab pipeline. It creates 4 and fails others with the same error. I updated provider's version to the current one (0.66.3) but can't say if new version introduced the problem because previously my pipeline was creating 4 VMs total (not 9 like now). Tried to mitigate with 'parallelism=3' but it didn't work - after creating 3+1 in two batches it failed. I believe @bpg already tried to address next VMID allocation issue in a few previous commits. I'll try to downgrade provider's version and see if it helps.

PVE 8.2.2 Terraform 1.9.3 bpg/proxmox 0.66.3

ratiborusx commented 3 weeks ago

I believe that's the PR i mentioned with vmid allocation rework - https://github.com/bpg/terraform-provider-proxmox/pull/1557 Maybe we could try this new 'random_vm_ids' feature, i'll check it out for sure. Still would be nice to get standard behavior in somewhat working order.

ratiborusx commented 3 weeks ago

Downgraded to 0.65.0 (as new vmid allocation features were added in 0.66), all 9 VMs were created successfully.

bpg commented 3 weeks ago

Thanks for testing @ratiborusx, we had a few other reports flagging this issue, so it's good to have a confirmation. I didn't have a chance to look into that yet, will try getting to it this weekend🤞

ratiborusx commented 3 weeks ago

Returned back to 0.66.3 to check random vmid feature - it does work as declared, all 9 VMs were created successfully. Added these to 'provider' block:

provider "proxmox" {
...
  random_vm_ids      = true
  random_vm_id_start = 90000
  random_vm_id_end   = 90999
...
}

So here's 2 ways to deal with the issue as of now, hopefully @bpg will be able to tinker with that stuff a bit more. For now i'll stay on 0.66.3 and will see how random vmid feature behaves. As i understand it should help prevent possible conflicts with vmid collision on allocation with parallel execution (for example a few pipelines and manual execution from workstation simultaneously on the same cluster) unlike pre-0.66.0 way.

bpg commented 3 weeks ago

I’m unable to pinpoint the issue 🤔. I can create six VMs simultaneously from the same clone without any problems. However, I noticed that the OP is cloning to a different node than the source, which I can’t test at the moment. @ratiborusx, is your use case similar, cloning between nodes?

ratiborusx commented 3 weeks ago

@bpg Oh boy, time for some late night testing. I'm pretty sure it is not the case - because we do not have anything NAS/nfs yet i created a bootstrapping module to prepare cluster for usage where every "basic" resource (cloud configs, images and templates) is duplicated on each node. Something like this:

$ terraform state list
module.proxmox_common.proxmox_virtual_environment_download_file.px_cloud_image["almalinux-9@prox-srv1"]
module.proxmox_common.proxmox_virtual_environment_download_file.px_cloud_image["almalinux-9@prox-srv2"]
module.proxmox_common.proxmox_virtual_environment_download_file.px_cloud_image["almalinux-9@prox-srv3"]
...
module.proxmox_common.proxmox_virtual_environment_vm.px_template["debian-12-main@prox-srv1"]
module.proxmox_common.proxmox_virtual_environment_vm.px_template["debian-12-main@prox-srv2"]
module.proxmox_common.proxmox_virtual_environment_vm.px_template["debian-12-main@prox-srv3"]
...
module.proxmox_common.proxmox_virtual_environment_file.px_ci_data["userdata-proxmox-generic-automation@prox-srv1"]
module.proxmox_common.proxmox_virtual_environment_file.px_ci_data["userdata-proxmox-generic-automation@prox-srv2"]
module.proxmox_common.proxmox_virtual_environment_file.px_ci_data["userdata-proxmox-generic-automation@prox-srv3"]

So when i use another module for actual VM provisioning it uses specified template from the same node it is being created on. All of that because initial tests showed that cloning from another node takes too long (if i remember correctly first it creates VM on the same node template is located on and then migrates it via network on the specified node). I believe there were some problems with cloud-init configs too - if the same userdata is not present on the node you clone to (or migrate to) it runs again and uses the default cloud.cfg/cloud.cfg.d stuff (at least that's how i remember it from last tests a year ago). Also i declare 'clone.node_name' variable as optional and do not specify it ever - in that case it should be the same as resource's 'node_name'. BUT now i've got a question - what is the use for 'clone.node_name' (which is optional and defaults to 'node_name' of VM being created if empty) at all if we also need to specify a required argument 'clone.vm_id' and VMID is cluster-unique? I probably forgot some of that stuff but just couldn't answer myself on this one at the moment...

Here's some (a bit truncated) output, i believe template and the clone(s) are located on the same node (prox-srv2):

Plan: 10 to add, 0 to change, 0 to destroy.
proxmox_virtual_environment_vm.px_vm["xdata-dev-stand-host-07"]: Creating...
proxmox_virtual_environment_vm.px_vm["xdata-dev-stand-host-08"]: Creating...
proxmox_virtual_environment_vm.px_vm["xdata-dev-stand-host-04"]: Creating...
proxmox_virtual_environment_vm.px_vm["xdata-dev-stand-host-01"]: Creating...
proxmox_virtual_environment_vm.px_vm["xdata-dev-stand-host-05"]: Creating...
proxmox_virtual_environment_vm.px_vm["xdata-dev-stand-host-02"]: Creating...
proxmox_virtual_environment_vm.px_vm["xdata-dev-stand-host-03"]: Creating...
proxmox_virtual_environment_vm.px_vm["xdata-dev-stand-host-09"]: Creating...
proxmox_virtual_environment_vm.px_vm["xdata-dev-stand-host-06"]: Creating...
proxmox_virtual_environment_vm.px_vm["xdata-dev-stand-mgt"]: Creating...
proxmox_virtual_environment_vm.px_vm["xdata-dev-stand-host-07"]: Still creating... [10s elapsed]
proxmox_virtual_environment_vm.px_vm["xdata-dev-stand-host-08"]: Still creating... [10s elapsed]
proxmox_virtual_environment_vm.px_vm["xdata-dev-stand-host-04"]: Still creating... [10s elapsed]
proxmox_virtual_environment_vm.px_vm["xdata-dev-stand-host-05"]: Still creating... [10s elapsed]
proxmox_virtual_environment_vm.px_vm["xdata-dev-stand-host-01"]: Still creating... [10s elapsed]
proxmox_virtual_environment_vm.px_vm["xdata-dev-stand-host-02"]: Still creating... [10s elapsed]
proxmox_virtual_environment_vm.px_vm["xdata-dev-stand-host-03"]: Still creating... [10s elapsed]
proxmox_virtual_environment_vm.px_vm["xdata-dev-stand-host-09"]: Still creating... [10s elapsed]
proxmox_virtual_environment_vm.px_vm["xdata-dev-stand-host-06"]: Still creating... [10s elapsed]
proxmox_virtual_environment_vm.px_vm["xdata-dev-stand-host-07"]: Still creating... [20s elapsed]
proxmox_virtual_environment_vm.px_vm["xdata-dev-stand-host-08"]: Still creating... [20s elapsed]
proxmox_virtual_environment_vm.px_vm["xdata-dev-stand-host-04"]: Still creating... [20s elapsed]
proxmox_virtual_environment_vm.px_vm["xdata-dev-stand-host-05"]: Still creating... [20s elapsed]
proxmox_virtual_environment_vm.px_vm["xdata-dev-stand-host-06"]: Still creating... [20s elapsed]
proxmox_virtual_environment_vm.px_vm["xdata-dev-stand-host-09"]: Still creating... [20s elapsed]
proxmox_virtual_environment_vm.px_vm["xdata-dev-stand-host-03"]: Still creating... [20s elapsed]
proxmox_virtual_environment_vm.px_vm["xdata-dev-stand-host-07"]: Still creating... [30s elapsed]
proxmox_virtual_environment_vm.px_vm["xdata-dev-stand-host-08"]: Still creating... [30s elapsed]
proxmox_virtual_environment_vm.px_vm["xdata-dev-stand-host-04"]: Still creating... [30s elapsed]
proxmox_virtual_environment_vm.px_vm["xdata-dev-stand-host-05"]: Still creating... [30s elapsed]
proxmox_virtual_environment_vm.px_vm["xdata-dev-stand-host-06"]: Still creating... [30s elapsed]
proxmox_virtual_environment_vm.px_vm["xdata-dev-stand-host-03"]: Still creating... [30s elapsed]
proxmox_virtual_environment_vm.px_vm["xdata-dev-stand-host-08"]: Creation complete after 32s [id=205]
proxmox_virtual_environment_vm.px_vm["xdata-dev-stand-host-07"]: Still creating... [40s elapsed]
proxmox_virtual_environment_vm.px_vm["xdata-dev-stand-host-04"]: Still creating... [40s elapsed]
proxmox_virtual_environment_vm.px_vm["xdata-dev-stand-host-03"]: Still creating... [40s elapsed]
proxmox_virtual_environment_vm.px_vm["xdata-dev-stand-host-07"]: Creation complete after 41s [id=202]
proxmox_virtual_environment_vm.px_vm["xdata-dev-stand-host-04"]: Creation complete after 44s [id=204]
proxmox_virtual_environment_vm.px_vm["xdata-dev-stand-host-03"]: Creation complete after 44s [id=203]
╷
│ Error: unable to retrieve the next available VM identifier: context deadline exceeded
│
│   with proxmox_virtual_environment_vm.px_vm["xdata-dev-stand-mgt"],
│   on main.tf line 17, in resource "proxmox_virtual_environment_vm" "px_vm":
│   17: resource "proxmox_virtual_environment_vm" "px_vm" {
│
╵
╷
│ Error: unable to retrieve the next available VM identifier: context deadline exceeded
│
│   with proxmox_virtual_environment_vm.px_vm["xdata-dev-stand-host-02"],
│   on main.tf line 17, in resource "proxmox_virtual_environment_vm" "px_vm":
│   17: resource "proxmox_virtual_environment_vm" "px_vm" {
│
╵
╷
│ Error: unable to retrieve the next available VM identifier: context deadline exceeded
│
│   with proxmox_virtual_environment_vm.px_vm["xdata-dev-stand-host-06"],
│   on main.tf line 17, in resource "proxmox_virtual_environment_vm" "px_vm":
│   17: resource "proxmox_virtual_environment_vm" "px_vm" {
│
╵
╷
│ Error: unable to retrieve the next available VM identifier: context deadline exceeded
│
│   with proxmox_virtual_environment_vm.px_vm["xdata-dev-stand-host-05"],
│   on main.tf line 17, in resource "proxmox_virtual_environment_vm" "px_vm":
│   17: resource "proxmox_virtual_environment_vm" "px_vm" {
│
╵
╷
│ Error: unable to retrieve the next available VM identifier: context deadline exceeded
│
│   with proxmox_virtual_environment_vm.px_vm["xdata-dev-stand-host-01"],
│   on main.tf line 17, in resource "proxmox_virtual_environment_vm" "px_vm":
│   17: resource "proxmox_virtual_environment_vm" "px_vm" {
│
╵
╷
│ Error: unable to retrieve the next available VM identifier: context deadline exceeded
│
│   with proxmox_virtual_environment_vm.px_vm["xdata-dev-stand-host-09"],
│   on main.tf line 17, in resource "proxmox_virtual_environment_vm" "px_vm":
│   17: resource "proxmox_virtual_environment_vm" "px_vm" {
│
╵

$ terraform state show data.proxmox_virtual_environment_vms.template_vms
# data.proxmox_virtual_environment_vms.template_vms:
data "proxmox_virtual_environment_vms" "template_vms" {
    id   = "some-id-was-here-123abc"
    tags = [
        "templates",
    ]
    vms  = [
    ...
        {
            name      = "astra-1.7.5-adv-main"
            node_name = "prox-srv2"
            status    = "stopped"
            tags      = [
                "image-astra-1.7.5-adv",
                "templates",
                "terraform",
            ]
            template  = true
            vm_id     = 149
        },
    ]
}

$ terraform state show 'proxmox_virtual_environment_vm.px_vm["xdata-dev-stand-host-03"]'
# proxmox_virtual_environment_vm.px_vm["xdata-dev-stand-host-03"]:
resource "proxmox_virtual_environment_vm" "px_vm" {
...
id                      = "203"
...
node_name               = "prox-srv2"
...
vm_id                   = 203
...
clone {
        datastore_id = null
        full         = true
        node_name    = null
        retries      = 3
        vm_id        = 149
    }
...
maxexcloo commented 2 weeks ago

I'm getting this error also:

OpenTofu used the selected providers to generate the following execution plan. Resource actions are indicated with the following symbols:
  + create

OpenTofu will perform the following actions:

  # proxmox_virtual_environment_vm.vm["au-pie-haos"] will be created
  + resource "proxmox_virtual_environment_vm" "vm" {
      + acpi                    = true
      + bios                    = "ovmf"
      + id                      = (known after apply)
      + ipv4_addresses          = (known after apply)
      + ipv6_addresses          = (known after apply)
      + keyboard_layout         = "en-us"
      + mac_addresses           = (known after apply)
      + machine                 = "q35"
      + migrate                 = false
      + name                    = "pie-haos"
      + network_interface_names = (known after apply)
      + node_name               = "pie"
      + on_boot                 = true
      + protection              = false
      + reboot                  = false
      + scsi_hardware           = "virtio-scsi-single"
      + started                 = true
      + stop_on_destroy         = false
      + tablet_device           = true
      + template                = false
      + timeout_clone           = 1800
      + timeout_create          = 1800
      + timeout_migrate         = 1800
      + timeout_move_disk       = 1800
      + timeout_reboot          = 1800
      + timeout_shutdown_vm     = 1800
      + timeout_start_vm        = 1800
      + timeout_stop_vm         = 300
      + vm_id                   = (known after apply)

      + cpu {
          + cores      = 2
          + hotplugged = 0
          + limit      = 0
          + numa       = false
          + sockets    = 1
          + type       = "host"
          + units      = 1024
        }

      + disk {
          + aio               = "io_uring"
          + backup            = true
          + cache             = "none"
          + datastore_id      = "local-zfs"
          + discard           = "on"
          + file_format       = "raw"
          + interface         = "virtio0"
          + iothread          = true
          + path_in_datastore = (known after apply)
          + replicate         = true
          + size              = 128
          + ssd               = false
        }

      + efi_disk {
          + datastore_id      = "local-zfs"
          + file_format       = (known after apply)
          + pre_enrolled_keys = false
          + type              = "4m"
        }

      + memory {
          + dedicated      = 4096
          + floating       = 0
          + keep_hugepages = false
          + shared         = 0
        }

      + network_device {
          + bridge      = "vmbr0"
          + enabled     = true
          + firewall    = true
          + mac_address = (known after apply)
          + model       = "virtio"
          + mtu         = 0
          + queues      = 0
          + rate_limit  = 0
          + vlan_id     = 0
        }

      + operating_system {
          + type = "l26"
        }
    }

Plan: 1 to add, 0 to change, 0 to destroy.

Do you want to perform these actions?
  OpenTofu will perform the actions described above.
  Only 'yes' will be accepted to approve.

  Enter a value: yes

proxmox_virtual_environment_vm.vm["au-pie-haos"]: Creating...
╷
│ Error: unable to retrieve the next available VM identifier: context deadline exceeded
│ 
│   with proxmox_virtual_environment_vm.vm["au-pie-haos"],
│   on proxmox.tf line 45, in resource "proxmox_virtual_environment_vm" "vm":
│   45: resource "proxmox_virtual_environment_vm" "vm" {
│ 
╵
Releasing state lock. This may take a few moments...

Have tried random_vm_ids on and off, using provider version v0.66.3 and OpenTofu v1.8.3.

maxexcloo commented 2 weeks ago

Removing the following from the ssh section of the provider seems to have made it work:

    node {
      address = var.terraform.proxmox.pie.host
      name    = var.terraform.proxmox.pie.name
    }
bpg commented 6 days ago

Removing the following from the ssh section of the provider seems to have made it work:

    node {
      address = var.terraform.proxmox.pie.host
      name    = var.terraform.proxmox.pie.name
    }

That doesn't seem to be related 🤔 This section is to configure provider's SSH client. The "next ID" functionality is using only PVE REST API

maxexcloo commented 6 days ago

Removing the following from the ssh section of the provider seems to have made it work:

    node {
      address = var.terraform.proxmox.pie.host
      name    = var.terraform.proxmox.pie.name
    }

That doesn't seem to be related 🤔 This section is to configure provider's SSH client. The "next ID" functionality is using only PVE REST API

Very strange - haven't had an issue since (although I may have missed something else!)

caendekerk commented 5 days ago

We are fighting with this issue for some time, very randomly it appears and currently breaks our deployment. According to our tests:

Testing with the newly released version 0.37.0 we were able to get some more logs regarding this issue. What immediately caught our attention was that IDs that are in use, are requested using the api call GET /api2/json/cluster/nextid?vmid=<UNAVAILABLE_ID>. In the logs we could not find a single call to GET /api2/json/cluster/nextid without the vmid query parameter, which would have returned an available ID. As seen in the Proxmox logs the HTTP error code 400 is returned, which indicates that the requested ID is not available.

Proxmox logs:

REDACTED [20/Nov/2024:11:24:50.279] pve~ pve/REDACTED 0/0/4/3/7 400 304 - - ---- 2/2/0/0/0 0/0 "GET /api2/json/cluster/nextid?vmid=190 HTTP/1.1"
REDACTED [20/Nov/2024:11:24:50.486] pve~ pve/REDACTED 0/0/3/3/6 400 304 - - ---- 2/2/0/0/0 0/0 "GET /api2/json/cluster/nextid?vmid=191 HTTP/1.1"
REDACTED [20/Nov/2024:11:24:50.693] pve~ pve/REDACTED 0/0/5/4/9 400 304 - - ---- 2/2/0/0/0 0/0 "GET /api2/json/cluster/nextid?vmid=192 HTTP/1.1"
REDACTED [20/Nov/2024:11:24:50.904] pve~ pve/REDACTED 0/0/0/3/3 400 304 - - ---- 2/2/0/0/0 0/0 "GET /api2/json/cluster/nextid?vmid=193 HTTP/1.1"
REDACTED [20/Nov/2024:11:24:51.109] pve~ pve/REDACTED 0/0/0/3/3 400 304 - - ---- 2/2/0/0/0 0/0 "GET /api2/json/cluster/nextid?vmid=194 HTTP/1.1"
REDACTED [20/Nov/2024:11:24:51.312] pve~ pve/REDACTED 0/0/3/3/6 400 304 - - ---- 2/2/0/0/0 0/0 "GET /api2/json/cluster/nextid?vmid=195 HTTP/1.1"
REDACTED [20/Nov/2024:11:24:51.519] pve~ pve/REDACTED 0/0/5/2/7 400 304 - - ---- 2/2/0/0/0 0/0 "GET /api2/json/cluster/nextid?vmid=196 HTTP/1.1"
REDACTED [20/Nov/2024:11:24:51.732] pve~ pve/REDACTED 0/0/9/3/12 400 304 - - ---- 2/2/0/0/0 0/0 "GET /api2/json/cluster/nextid?vmid=197 HTTP/1.1"
REDACTED [20/Nov/2024:11:24:51.947] pve~ pve/REDACTED 0/0/3/3/6 400 304 - - ---- 2/2/0/0/0 0/0 "GET /api2/json/cluster/nextid?vmid=198 HTTP/1.1"
REDACTED [20/Nov/2024:11:24:52.155] pve~ pve/REDACTED 0/0/3/2/5 400 304 - - ---- 2/2/0/0/0 0/0 "GET /api2/json/cluster/nextid?vmid=199 HTTP/1.1"
REDACTED [20/Nov/2024:11:24:52.361] pve~ pve/REDACTED 0/0/2/2/4 400 304 - - ---- 2/2/0/0/0 0/0 "GET /api2/json/cluster/nextid?vmid=200 HTTP/1.1"
REDACTED [20/Nov/2024:11:24:52.567] pve~ pve/REDACTED 0/0/3/2/5 400 304 - - ---- 2/2/0/0/0 0/0 "GET /api2/json/cluster/nextid?vmid=201 HTTP/1.1"
REDACTED [20/Nov/2024:11:24:52.774] pve~ pve/REDACTED 0/0/5/5/10 400 304 - - ---- 2/2/0/0/0 0/0 "GET /api2/json/cluster/nextid?vmid=202 HTTP/1.1"
REDACTED [20/Nov/2024:11:24:52.985] pve~ pve/REDACTED 0/0/4/5/9 400 304 - - ---- 2/2/0/0/0 0/0 "GET /api2/json/cluster/nextid?vmid=203 HTTP/1.1"
REDACTED [20/Nov/2024:11:24:53.197] pve~ pve/REDACTED 0/0/3/3/6 400 304 - - ---- 2/2/0/0/0 0/0 "GET /api2/json/cluster/nextid?vmid=204 HTTP/1.1"
REDACTED [20/Nov/2024:11:24:53.404] pve~ pve/REDACTED 0/0/3/3/6 400 304 - - ---- 2/2/0/0/0 0/0 "GET /api2/json/cluster/nextid?vmid=205 HTTP/1.1"
REDACTED [20/Nov/2024:11:24:53.611] pve~ pve/REDACTED 0/0/4/3/7 400 304 - - ---- 2/2/0/0/0 0/0 "GET /api2/json/cluster/nextid?vmid=206 HTTP/1.1"
REDACTED [20/Nov/2024:11:24:53.819] pve~ pve/REDACTED 0/0/3/3/6 400 304 - - ---- 2/2/0/0/0 0/0 "GET /api2/json/cluster/nextid?vmid=207 HTTP/1.1"
REDACTED [20/Nov/2024:11:24:54.028] pve~ pve/REDACTED 0/0/4/4/8 400 304 - - ---- 2/2/0/0/0 0/0 "GET /api2/json/cluster/nextid?vmid=208 HTTP/1.1"
REDACTED [20/Nov/2024:11:24:54.236] pve~ pve/REDACTED 0/0/3/3/6 400 304 - - ---- 2/2/0/0/0 0/0 "GET /api2/json/cluster/nextid?vmid=209 HTTP/1.1"
REDACTED [20/Nov/2024:11:24:54.443] pve~ pve/REDACTED 0/0/3/6/9 400 304 - - ---- 2/2/0/0/0 0/0 "GET /api2/json/cluster/nextid?vmid=210 HTTP/1.1"
REDACTED [20/Nov/2024:11:24:54.654] pve~ pve/REDACTED 0/0/3/2/5 400 304 - - ---- 2/2/0/0/0 0/0 "GET /api2/json/cluster/nextid?vmid=211 HTTP/1.1"
REDACTED [20/Nov/2024:11:24:54.861] pve~ pve/REDACTED 0/0/4/2/6 400 304 - - ---- 2/2/0/0/0 0/0 "GET /api2/json/cluster/nextid?vmid=212 HTTP/1.1"
REDACTED [20/Nov/2024:11:24:55.070] pve~ pve/REDACTED 0/0/3/3/6 400 304 - - ---- 2/2/0/0/0 0/0 "GET /api2/json/cluster/nextid?vmid=213 HTTP/1.1"
REDACTED [20/Nov/2024:11:24:55.276] pve~ pve/REDACTED 0/0/3/3/6 400 304 - - ---- 2/2/0/0/0 0/0 "GET /api2/json/cluster/nextid?vmid=214 HTTP/1.1"
REDACTED [20/Nov/2024:11:24:55.483] pve~ pve/REDACTED 0/0/4/2/6 400 304 - - ---- 2/2/0/0/0 0/0 "GET /api2/json/cluster/nextid?vmid=215 HTTP/1.1"
REDACTED [20/Nov/2024:11:24:55.691] pve~ pve/REDACTED 0/0/3/3/6 400 304 - - ---- 2/2/0/0/0 0/0 "GET /api2/json/cluster/nextid?vmid=216 HTTP/1.1"
REDACTED [20/Nov/2024:11:24:55.898] pve~ pve/REDACTED 0/0/5/3/8 400 304 - - ---- 2/2/0/0/0 0/0 "GET /api2/json/cluster/nextid?vmid=217 HTTP/1.1"
REDACTED [20/Nov/2024:11:24:56.108] pve~ pve/REDACTED 0/0/4/3/7 400 304 - - ---- 2/2/0/0/0 0/0 "GET /api2/json/cluster/nextid?vmid=218 HTTP/1.1"

Additionally we can provide a relevant snippet from our tofu apply logs.

tofu apply logs:

2024-11-20T11:24:31.356Z [DEBUG] provider.terraform-provider-proxmox_v0.67.0: Sending HTTP Request: @caller=/home/runner/go/pkg/mod/github.com/hashicorp/terraform-plugin-sdk/v2@v2.35.0/helper/logging/logging_http_transport.go:162 Host=REDACTED User-Agent=Go-http-client/1.1 tf_http_req_body="" tf_mux_provider=tf5to6server.v5tov6Server Accept=application/json tf_http_req_method=GET tf_provider_addr=registry.terraform.io/bpg/proxmox tf_req_id=19938013-de2b-01e4-2f70-fea77f785e61 Authorization="PVEAPIToken=REDACTED" tf_http_op_type=request tf_http_req_uri=/api2/json/cluster/nextid?vmid=218 tf_rpc=ApplyResourceChange @module=proxmox Accept-Encoding=gzip tf_http_req_version=HTTP/1.1 tf_http_trans_id=49601e70-85f3-f9b7-65b0-553a9a5f3c3f tf_resource_type=proxmox_virtual_environment_vm timestamp=2024-11-20T11:24:31.355Z

2024-11-20T11:24:31.364Z [DEBUG] provider.terraform-provider-proxmox_v0.67.0: Received HTTP Response: tf_http_res_status_code=400 tf_http_res_version=HTTP/1.1 tf_http_trans_id=49601e70-85f3-f9b7-65b0-553a9a5f3c3f tf_resource_type=proxmox_virtual_environment_vm Server=pve-api-daemon/3.0 tf_http_op_type=response tf_http_res_body="{\"errors\":{\"vmid\":\"VM 218 already exists\"},\"data\":null}" tf_req_id=19938013-de2b-01e4-2f70-fea77f785e61 @caller=/home/runner/go/pkg/mod/github.com/hashicorp/terraform-plugin-sdk/v2@v2.35.0/helper/logging/logging_http_transport.go:162 @module=proxmox Cache-Control=max-age=0 Pragma=no-cache tf_http_res_status_reason="400 Parameter verification failed." tf_mux_provider=tf5to6server.v5tov6Server tf_rpc=ApplyResourceChange Content-Length=55 tf_provider_addr=registry.terraform.io/bpg/proxmox Content-Type=application/json;charset=UTF-8 Date="Wed, 20 Nov 2024 11:24:31 GMT" Expires="Wed, 20 Nov 2024 11:24:31 GMT" timestamp=2024-11-20T11:24:31.364Z

2024-11-20T11:24:31.523Z [ERROR] provider.terraform-provider-proxmox_v0.67.0: Response contains error diagnostic: diagnostic_summary="unable to retrieve the next available VM identifier: context deadline exceeded" tf_rpc=ApplyResourceChange diagnostic_severity=ERROR tf_provider_addr=registry.terraform.io/bpg/proxmox tf_req_id=19938013-de2b-01e4-2f70-fea77f785e61 @module=sdk.proto diagnostic_detail="" @caller=/home/runner/go/pkg/mod/github.com/hashicorp/terraform-plugin-go@v0.25.0/tfprotov6/internal/diag/diagnostics.go:58 tf_proto_version=6.7 tf_resource_type=proxmox_virtual_environment_vm timestamp=2024-11-20T11:24:31.522Z

2024-11-20T11:24:31.540Z [DEBUG] State storage *remote.State declined to persist a state snapshot

2024-11-20T11:24:31.540Z [ERROR] vertex "REDACTED.proxmox_virtual_environment_vm.user_vm" error: unable to retrieve the next available VM identifier: context deadline exceeded

2024-11-20T11:24:31.556Z [WARN]  provider.terraform-provider-proxmox_v0.67.0: unable to require attribute replacement: error="ForceNew: No changes for vm_id" tf_attribute_path=vm_id tf_req_id=b25857c0-a85e-1517-1667-fe4e5bf1f4ec @caller=/home/runner/go/pkg/mod/github.com/hashicorp/terraform-plugin-sdk/v2@v2.35.0/helper/customdiff/force_new.go:32 tf_mux_provider=tf5to6server.v5tov6Server tf_provider_addr=registry.terraform.io/bpg/proxmox tf_resource_type=proxmox_virtual_environment_vm tf_rpc=PlanResourceChange @module=sdk.helper_schema timestamp=2024-11-20T11:24:31.556Z

We hope, that the provided insight helps to fix the underlying problem. Should more information be required, don't hesitate to reach out to us.

bpg commented 5 days ago

Thanks @caendekerk, that helps!

caendekerk commented 5 days ago

Randomizing the VM ids as suggested above did help to fix our deployment.

bpg commented 5 days ago

Related: #1574

bpg commented 5 days ago

Ok, so here is the use case that doesn't work well.

Precondition: there are a lot of VMs / containers on PVE, but there is a gap in ID allocation closer to the beginning of the range. For example: 100, 101, 102, 105, 106, \<continue without gaps>, 150 (technically the tail of continuous IDs after the gap should be at least 25)

Now imagine we need to provision 4 VMs form a single config. By default, TF parallelism is 4, so those VMs will start provision simultaneously in separate threads. They all ask for "get next id" from PVE at the same time, and all receive "here is: 103!" response from the PVE.

Obviously, that won't work. The provider has a locking mechanism that prevents parallel executions to use the same ID, so a thread that got into this condition uses a simple "+1" logic to try the next one. And this won't work in all cases, especially if the "gap" of unused IDs is smaller than the number of concurrently provisioning VMs at the moment. So the provider will keep enumerating ID up to find available until it times out (currently 5 sec, with 200ms delay between iterations).

It seems logical to use the "get next id" API call again instead of "+1", however, it is not guarantee to return the actually available ID, as the could be in-flight "create VM" call for e.g. ID 103, but it is not committed by PVE yet, so it may happily return 103 again.

I think a few retries should solve that. I'll also add an acceptance test to verify this scenario.

bpg commented 5 days ago

Randomizing the VM ids as suggested above did help to fix our deployment.

Another workaround besides using random IDs is to set parallelism = 1 for terraform|tofu apply

bpg commented 5 days ago

Just pushed v0.67.1 release that should have it fixed

caendekerk commented 5 days ago

Fyi: Setting parallelism to 1 did not help in our case.