canonical / terraform-provider-maas

Terraform MAAS provider
Mozilla Public License 2.0
64 stars 46 forks source link

Logical Destruction from Drift Attempts to "Physically" Destroy Machine Components #203

Open sempervictus opened 5 months ago

sempervictus commented 5 months ago

Dynamic dispatch of resource such as NICs can result in destruction of NICs when MaaSMAAS re-commissions a machine drifting state such as to have the resource fall out of scope:

data "maas_network_interface_physical" "metal-1" {
  for_each = maas_machine.metal-1
  machine = each.value.id
  name    = "enp3s0"
}
resource "maas_network_interface_link" "metal-1" {
  for_each          = data.maas_network_interface_physical.metal-1
  machine           = each.value.machine
  network_interface = each.value.id
  subnet            = maas_subnet.prov_subnet_1.id
  mode              = "STATIC"
  ip_address        = cidrhost(maas_subnet.prov_subnet_1.cidr, 3 + index(keys(data.maas_network_interface_physical.metal-1), each.key))
  default_gateway   = true
}

but if the firmware update just completed or a k-commandline option changes that NIC name, it will cause the resource to be destroyed. This creates a death spiral because the destruction happens not just in tf state, but in MaaSMAAS itself - the nic is deleted from the machine requiring a re-commission to regain awareness of it. This seems like terminally incorrect behavior as TF cannot delete disks or NICs from a physical server or VM given to MaaS at this level of interface (in our case, since we're using a TF openstack provider it could trim them form the underlying instance, but the maas provider cannot). Other resources then depending on this resource get deleted and the provider loses state sync causing this disk dispatch block with the network one above:

data "maas_block_device" "metal-1" {
  for_each =  { for entry in local.disk_map_1: "${entry.machine}.${entry.disk}" => entry }
  name = each.value.disk
  machine = each.value.machine
}
resource "maas_block_device" "metal-1-os" {
  for_each       = { 
    for k,v in data.maas_block_device.metal-1: k => v if v.size_gigabytes != 128 && contains(flatten([
      for p in v.partitions: [
        p.mount_point
      ]
    ]),"/")
  }
  machine        = each.value.machine
  name           = each.value.name
  id_path        = each.value.id_path
  size_gigabytes = each.value.size_gigabytes
  is_boot_device = true
  tags = [
    "os_drive",
  ]
  partitions {
    size_gigabytes = 32
    fs_type        = "ext4"
    label          = "os"
    mount_point    = "/"
  }
  partitions {
    size_gigabytes = 1
    fs_type        = "fat32"
    mount_point    = "/boot"
    bootable       = true
  }
}

to correctly identify OS drives after a commissioning pass but lose track of other disks and crash the action:

OpenTofu used the selected providers to generate the following execution
plan. Resource actions are indicated with the following symbols:
  + create
  ~ update in-place

OpenTofu planned the following actions, but then encountered a problem:

  # maas_block_device.metal-1-os["test-metal1-00.sdg"] will be created
  + resource "maas_block_device" "metal-1-os" {
      + block_size     = 512
      + id             = (known after apply)
      + id_path        = "/dev/disk/by-id/scsi-SQEMU_QEMU_HARDDISK_f150670d-9df4-440e-855f-07eb00906be1"
      + is_boot_device = true
      + machine        = "fwtq6w"
      + model          = (known after apply)
      + name           = "sdg"
      + path           = (known after apply)
      + serial         = (known after apply)
      + size_gigabytes = 64
      + tags           = [
          + "os_drive",
        ]
      + uuid           = (known after apply)

      + partitions {
          + fs_type        = "ext4"
          + label          = "os"
          + mount_point    = "/"
          + path           = (known after apply)
          + size_gigabytes = 32
        }
      + partitions {
          + bootable       = true
          + fs_type        = "fat32"
          + mount_point    = "/boot"
          + path           = (known after apply)
          + size_gigabytes = 1
        }
    }

  # maas_block_device.metal-1-os["test-metal1-01.sdg"] will be updated in-place
  ~ resource "maas_block_device" "metal-1-os" {
        id             = "20"
        name           = "sdg"
      ~ tags           = [
          - "1rpm",
          - "rotary",
          + "os_drive",
        ]
        # (8 unchanged attributes hidden)

      ~ partitions {
          ~ label          = "root" -> "os"
          ~ size_gigabytes = 63 -> 32
            tags           = []
            # (4 unchanged attributes hidden)
        }
      + partitions {
          + bootable       = true
          + fs_type        = "fat32"
          + mount_point    = "/boot"
          + size_gigabytes = 1
        }
    }

  # maas_block_device.metal-1-os["test-metal1-02.sdg"] will be created
  + resource "maas_block_device" "metal-1-os" {
      + block_size     = 512
      + id             = (known after apply)
      + id_path        = "/dev/disk/by-id/scsi-SQEMU_QEMU_HARDDISK_4a7e675b-d4e8-4f9e-9a4c-7c1b8902c3c8"
      + is_boot_device = true
      + machine        = "kherdk"
      + model          = (known after apply)
      + name           = "sdg"
      + path           = (known after apply)
      + serial         = (known after apply)
      + size_gigabytes = 64
      + tags           = [
          + "os_drive",
        ]
      + uuid           = (known after apply)

      + partitions {
          + fs_type        = "ext4"
          + label          = "os"
          + mount_point    = "/"
          + path           = (known after apply)
          + size_gigabytes = 32
        }
      + partitions {
          + bootable       = true
          + fs_type        = "fat32"
          + mount_point    = "/boot"
          + path           = (known after apply)
          + size_gigabytes = 1
        }
    }

  # maas_block_device.metal-1-os["test-metal1-03.sdg"] will be created
  + resource "maas_block_device" "metal-1-os" {
      + block_size     = 512
      + id             = (known after apply)
      + id_path        = "/dev/disk/by-id/scsi-SQEMU_QEMU_HARDDISK_d144595b-1881-416f-8797-63f102c731e4"
      + is_boot_device = true
      + machine        = "yxsy7e"
      + model          = (known after apply)
      + name           = "sdg"
      + path           = (known after apply)
      + serial         = (known after apply)
      + size_gigabytes = 64
      + tags           = [
          + "os_drive",
        ]
      + uuid           = (known after apply)

      + partitions {
          + fs_type        = "ext4"
          + label          = "os"
          + mount_point    = "/"
          + path           = (known after apply)
          + size_gigabytes = 32
        }
      + partitions {
          + bootable       = true
          + fs_type        = "fat32"
          + mount_point    = "/boot"
          + path           = (known after apply)
          + size_gigabytes = 1
        }
    }

Plan: 3 to add, 1 to change, 0 to destroy.

....
│ Error: ServerError: 404 Not Found (No BlockDevice matches the given query.)
│ 
│   with maas_block_device.metal-1-data["test-metal1-02.sde"],
│   on maas-blockdev.tf line 54, in resource "maas_block_device" "metal-1-data":
│   54: resource "maas_block_device" "metal-1-data" {
│ 
╵
╷
│ Error: cannot find link (32) on the network interface (12) from machine (kherdk)
│ 
│   with maas_network_interface_link.metal-1["test-metal1-02"],
│   on maas-machines.tf line 16, in resource "maas_network_interface_link" "metal-1":
│   16: resource "maas_network_interface_link" "metal-1" {
│ 
╵
...

bottom line being that physical components should not be capable of being destroyed - disks, NICs, BMCs. Logical partitions, VLANs, etc ... of course; but this logic bug messes up reference tracking and brings down the house of cards pretty hard requiring state rm and other bad behaviors to remedy.

github-actions[bot] commented 3 months ago

This issue is stale because it has been open for 30 days with no activity.

github-actions[bot] commented 2 months ago

This issue is stale because it has been open for 30 days with no activity.