canonical / terraform-provider-maas

Terraform MAAS provider
Mozilla Public License 2.0
60 stars 43 forks source link

Updating tags gets slower and slower with more nodes. #175

Closed fivovic closed 1 week ago

fivovic commented 3 months ago

For reference, I currently manage ~230 machines defined within terraform, deployed via MaaS.

They are defined in machine/instance block loops, each taking ~6 mins to comission and ~12 mins to deploy.

They are all tagged with a default "terraform" tag to indicate their origin, along with a few other tags depending on their purpose.

When adding 1+ machines, updating this "terraform" tag takes ~20 mins!

Here is the tag resource:

resource "maas_tag" "terraform" {
    name        = "terraform"
    comment     = "automated deployment via terraform"
    machines    = split(" ",
        join(" ",
            [ for item in maas_machine.node_type_01 : item.id ],
            [ for item in maas_machine.node_type_02 : item.id ],
            [ for item in maas_machine.node_type_03 : item.id ],
            [ for item in maas_machine.node_type_04 : item.id ],
            [ for item in maas_machine.node_type_05 : item.id ]
        )
    )
}

I remember when it took a couple of mins to update a dozen nodes, several minutes to update less than a hundred nodes, and this problem is only getting worse and worse.

I'm not certain but the problem seems unrelated to the terraform logic used to build the machine list, only that a large list of nodes is supplied. All other tags seem affected by this huge performance hit dependant entirely on how many nodes are in each machine list, but it is most obvious in the tag that is applied to all nodes.

This update logic seems wildly unsustainable. I would love some pointers to understand this bottleneck because it feels like there must be thousands of API calls happening under the hood for a fairly simple update operation?

github-actions[bot] commented 2 months ago

This issue is stale because it has been open for 30 days with no activity.

spdfnet commented 2 months ago

I feel this is a MAAS API limitation/behavior rather than the provider itself.

For reference, removing/changing tags on block devices takes a long time as well, regardless of the tool used (cli, curl, OpenTofu, etc.).

fivovic commented 1 month ago

I don't think the individual operation speed is at fault, but how many operations terraform is performing.

This manual command takes maybe 1 second to run, and is what I would expect from adding a node to an existing tag (which already has 200 machines tagged): maas admin tag update-nodes tag_name add=$node_id

But adding 1 node to the terraform indicates it will delete 200+ IDs from this tag and then I presume later calculate them all and add them, sanitised output below:

OpenTofu will perform the following actions:

  # maas_instance.group_name["host_name"] will be created
  + resource "maas_instance" "group_name" {
      + cpu_count    = (known after apply)
      + fqdn         = (known after apply)
      + hostname     = (known after apply)
      + id           = (known after apply)
      + ip_addresses = (known after apply)
      + memory       = (known after apply)
      + pool         = (known after apply)
      + tags         = (known after apply)
      + zone         = (known after apply)

      + allocate_params {
          + hostname      = "host_name"
          + min_cpu_count = 0
          + min_memory    = 0
          + pool          = "pool_name"
          + tags          = [
              + "terraform",
            ]
        }

      + deploy_params {
          + distro_series  = "custom/image"
          + enable_hw_sync = false
        }

      + network_interfaces {
          + ip_address  = "ip_value"
          + name        = "interface_name"
          + subnet_cidr = "subnet_value"
        }
    }

  # maas_machine.group_name["host_name"] will be created
  + resource "maas_machine" "group_name" {
      + architecture     = "amd64/generic"
      + domain           = "domain_name"
      + hostname         = "host_name"
      + id               = (known after apply)
      + min_hwe_kernel   = (known after apply)
      + pool             = "pool_name"
      + power_parameters = (sensitive value)
      + power_type       = "ipmi"
      + pxe_mac_address  = "mac_value"
      + zone             = (known after apply)
    }

  # maas_tag.terraform will be updated in-place
  ~ resource "maas_tag" "terraform" {
        id       = "terraform"
      ~ machines = [
          - "43w7ys",
          - "46mkgk",
          - "47qtqh",
          - "4btsxp",
          - "4cr7xg",
          - "4d77gc",
          - "4fe8sm",
          - "4kghms",
          - "4ntdsp",
          - "4ra8ts",
          - "4w4y3k",
          - "66rdxm",
          - "6cdrfq",
          - "6ftsgq",
          - "6mfwfk",
          - "6mtyrg",
          - "6q3rmg",
          - "6q7c4h",
          - "6qra6f",
          - "6rrgse",
          - "6wpnh8",
          - "73fgwd",
          - "76xf3y",
          - "777sbx",
          - "77p7ts",
          - "77rsbg",
          - "7drxka",
          - "7mmstm",
          - "87tpgh",
          - "8d33gt",
          - "8d7afw",
          - "8ddysf",
          - "8dgyph",
          - "8dnmmh",
          - "8r3pxk",
          - "8wmmtf",
          - "a3b8rh",
          - "a3f3a3",
          - "a77tpe",
          - "abbxnf",
          - "acefkw",
          - "agkkx3",
          - "anrp8g",
          - "apmt64",
          - "argkns",
          - "ayhdqw",
          - "b3dtgn",
          - "b8m4m3",
          - "ba3an3",
          - "bamaqk",
          - "bpcebr",
          - "brpyck",
          - "ca36hf",
          - "cakyr8",
          - "cdayyr",
          - "cf34dm",
          - "ch3e66",
          - "ckpqwy",
          - "cnhpen",
          - "cnq84g",
          - "cqpd6w",
          - "crbnax",
          - "cs3ybw",
          - "csfkxx",
          - "cw73dq",
          - "cxdm7s",
          - "d6qphb",
          - "da8qxn",
          - "dddpcr",
          - "de8enw",
          - "dghkyp",
          - "dh73mx",
          - "dm8a3b",
          - "dynrtq",
          - "dyqycb",
          - "e3xbfy",
          - "e4rxky",
          - "e6pkpk",
          - "e8qh8c",
          - "e8xy67",
          - "edheay",
          - "edpn6s",
          - "efhraf",
          - "ehf67n",
          - "epfedh",
          - "exagap",
          - "f3r4kw",
          - "f8mnnr",
          - "fb734n",
          - "fcqwdg",
          - "fd6axf",
          - "fdty7s",
          - "fhedht",
          - "fhggqa",
          - "fpqecb",
          - "fqcppm",
          - "fsrrmy",
          - "fydsyr",
          - "g3n3w8",
          - "g4cbxk",
          - "g8bnps",
          - "gfbmqm",
          - "grsphw",
          - "grwsna",
          - "gwt3bk",
          - "gwxdk4",
          - "gy7fqh",
          - "gykbtq",
          - "h3cxhw",
          - "h4hhpa",
          - "h4swnt",
          - "h8dxsk",
          - "hdptkd",
          - "hnck87",
          - "hqdb8c",
          - "k38pf4",
          - "kc7eng",
          - "kdptan",
          - "kfspqf",
          - "kmmb4b",
          - "knqxwp",
          - "kptqdq",
          - "m33py4",
          - "m36kb6",
          - "m4n3pa",
          - "m7tg7h",
          - "mdm3tq",
          - "mfpfyw",
          - "mhmdfm",
          - "mhpcch",
          - "mph8k8",
          - "mxbdr8",
          - "mypkxc",
          - "nbgbqc",
          - "ngqagk",
          - "nhmgc3",
          - "nhwdgp",
          - "nkey34",
          - "nmbb3s",
          - "np4s3y",
          - "nsgd4g",
          - "nt8rbb",
          - "ntmskf",
          - "nwecd6",
          - "p33nmf",
          - "p4nkhh",
          - "p688xx",
          - "p6q7px",
          - "p7bsc6",
          - "p7mw3t",
          - "pamc6y",
          - "pcs6m8",
          - "pdm7yb",
          - "pggafy",
          - "pgwcqt",
          - "pmbpe7",
          - "pprryg",
          - "pq7bwq",
          - "pxcrkg",
          - "q3pcbe",
          - "q8yycs",
          - "qemfcn",
          - "qerxn4",
          - "qgfmbg",
          - "qk3bxd",
          - "qkk3mr",
          - "qpfa7n",
          - "qqqrk3",
          - "r3w8wb",
          - "rktq6x",
          - "rrw3qq",
          - "rt677c",
          - "rth8hg",
          - "rwcfeg",
          - "rypq6h",
          - "s3sa77",
          - "s7cskm",
          - "s8qcb7",
          - "scdh6c",
          - "sfeg7n",
          - "shmxk7",
          - "skxyhb",
          - "smy8ah",
          - "sparkt",
          - "sqnhqy",
          - "ssexya",
          - "ssnqex",
          - "sx84qb",
          - "t38fcs",
          - "t748b4",
          - "tgfxxr",
          - "th4bfa",
          - "tnh3gh",
          - "tnrk7w",
          - "tpktbf",
          - "trcp3y",
          - "tta7kg",
          - "ttfxqr",
          - "txbak3",
          - "tywf6x",
          - "wenhcw",
          - "wpwxqa",
          - "wxcsy4",
          - "wxfdya",
          - "x4n6wa",
          - "xd4txr",
          - "xhpb4x",
          - "xka4se",
          - "xpfs4w",
          - "xpprde",
          - "xsdp7q",
          - "xshax6",
          - "xwbxnd",
          - "xy8gcb",
          - "y33cyg",
          - "yc63hp",
          - "yddsrh",
          - "ydfrxe",
          - "ydna4y",
          - "yet78w",
          - "yg8neh",
          - "ygc8a3",
          - "ygsw7g",
          - "ykhq8r",
          - "ypqmsa",
          - "yrwtt6",
          - "ywppqa",
        ] -> (known after apply)
        name     = "terraform"
        # (1 unchanged attribute hidden)
    }

Plan: 2 to add, 1 to change, 0 to destroy.

Either I have fallen into some trap with regards to how terraform resolves loops in list items or there is some logic problem here?

Note that without adding this one node, no changes will be applied and all infrastructure matches terraform definitions.

skatsaounis commented 1 month ago

Hi @fivovic and thank you for opening this issue. Sorry for the big delay in the reply. I think that your issue is valid. Let me try to explain what the provider is doing whenever you want to update a tag resource:

As you can see, the part of fetching 200 times all the MAAS machines adds that enormous delay to the operations, it is unnecessary and it should be fixed.

FYI this is the code line that is triggered for every identifier in the machines set: https://github.com/canonical/terraform-provider-maas/blob/7153222b6b91a2369bbf59d942414a1f5eed211b/maas/resource_maas_tag.go#L197