bpg / terraform-provider-proxmox

Terraform Provider for Proxmox
https://registry.terraform.io/providers/bpg/proxmox
Mozilla Public License 2.0
665 stars 114 forks source link

`HTTP 500` on some resources when reinstalling `Proxmox` #1152

Open aleprovencio opened 3 months ago

aleprovencio commented 3 months ago

Describe the bug I have several resources created by this provider on a node, but when reinstalling Proxmox on it and trying to getting it back to the previous state by applying terraform, although it works fine for most resources I have found a few problems.

I'm unsure whether I should have done prior modifications on terraform's state, because I did not, but terraform recreates all resources (VM, containers, etc) and returns HTTP 500 errors on the following:

To Reproduce Steps to reproduce the behavior:

  1. Create any of the described resources above
  2. run terraform apply
  3. Reinstall Proxmox
  4. Run terraform apply
  5. See errors

Please also provide a minimal Terraform configuration that reproduces the issue.


resource "proxmox_virtual_environment_group" "admin" {
  group_id = "admin"
  comment  = "Managed by Terraform"
  acl {
    path      = "/"
    propagate = true
    role_id   = "Administrator"
  }
}

resource "proxmox_virtual_environment_user" "tf-packer" {
  acl {
    path    = "/"
    role_id = proxmox_virtual_environment_role.tf-packer.role_id
  }
  comment = "Managed by Terraform"
  user_id = "tf-packer@pve"
}

resource "proxmox_virtual_environment_user" "prometheus" {
  acl {
    path    = "/"
    role_id = "PVEAuditor"
  }
  comment = "Managed by Terraform"
  user_id = "prometheus@pve"
}

resource "proxmox_virtual_environment_role" "tf-packer" {
  role_id = "tf-packer"
  privileges = [
    "VM.Allocate",
    "VM.Clone",
    "VM.Config.CDROM",
    "VM.Config.CPU",
    "VM.Config.Cloudinit",
    "VM.Config.Disk",
    "VM.Config.HWType",
    "VM.Config.Memory",
    "VM.Config.Network",
    "VM.Config.Options",
    "VM.Console",
    "VM.Monitor",
    "VM.Audit",
    "VM.PowerMgmt",
    "Datastore.AllocateSpace",
    "Datastore.Audit",
    "Pool.Allocate",
    "Sys.Audit",
    "Sys.Console",
    "Sys.Modify",
    "SDN.Use",
    "VM.Migrate",
  ]
}

resource "proxmox_virtual_environment_cluster_firewall_security_group" "ping_ssh" {
  name    = "ssh-ping"
  comment = "SSH and ping"

  rule {
    comment = "Ping"
    type    = "in"
    action  = "ACCEPT"
    proto   = "icmp"
  }

  rule {
    comment = "SSH"
    type    = "in"
    action  = "ACCEPT"
    macro   = "SSH"
  }

}

resource "proxmox_virtual_environment_cluster_firewall_security_group" "promtail_node_exp" {
  name    = "promtail-node-exp"
  comment = "Promtail and node exporter"

  rule {
    type    = "in"
    action  = "ACCEPT"
    comment = "Promtail"
    proto   = "tcp"
    dport   = "9080"
  }

  rule {
    type    = "in"
    action  = "ACCEPT"
    comment = "Prometheus node exporter"
    proto   = "tcp"
    dport   = "9100"
  }

}

Expected behavior Resources are recreated like the other ones

Additional context Add any other context about the problem here.

bpg commented 3 months ago

Hi @aleprovencio! 👋🏼

Was there any error messages reported by PVE? These issues are quite hard to debug without reproducing, which means quite a bit of efforts with reinstalling PVE. So any additional details is really appreciated.

In general, if you reset the remote state of the resource (i.e. deleted the resource outside of terraform), the local TF state should be also deleted, so there is no inconsistency or "state drift" for the provider to reconcile.

aleprovencio commented 3 months ago

Hello @bpg, thanks for the reply and of course, also for this awesome project.

It does makes sense to me that I should probably remove all proxmox related resources from terraform state prior reinstalling it, in order to prevent the so called "state drift".

However I still would like to understand why resources like proxmox_virtual_environment_vm or proxmox_virtual_environment_container get recreated while resources mentioned on this issue do not, on a test like I did without manual interventions on the terraform state.

Regarding errors, I don't see anything special on PVE's side and on terraform's, besides the HTTP 500 it just says that those resources do not exist. I wish I could give you additional details on the problem, but that's all I have for now, maybe I could try better debugging with your help.

bpg commented 3 months ago

It looks like the affected resource are "compound resources", i.e. they have references to other separate proxmox entities that are on different API paths. When provider applies a change, first it has to read the resource state from the remote to detect the "drift". I think there are logical or implementation bugs in those resources,they probably are trying to read the dependent objects first (like ACLs for a user, or rules for a security group) using the "parent" object ID as a request criteria. Those parents do not exist, and requests fail.

That's my hypothesis, without any actual debugging. There is definitely something in the provider's implementation that can be improved in this regard, though a proper investigation is needed to make a fix.

aleprovencio commented 3 months ago

Yeah I guess you are on the right path.

I've done a new test where I did remove those resources from state, reinstalled proxmox and although terraform apply seemed worked flawlessly the first time, issuing the same command again still suggested changes on these resources we talk about.