OpenNebula / terraform-provider-opennebula

Terraform provider for OpenNebula
https://www.terraform.io/docs/providers/opennebula/
Mozilla Public License 2.0
65 stars 52 forks source link

opennebula_cluster removes resources on secondary run #389

Closed TGM closed 1 year ago

TGM commented 1 year ago

OpenNebula version: 6.4.0 terraform-provider-opennebula: 1.1.0

On the second run the opennebula_cluster are removed from cluster.

Plan:

resource "opennebula_cluster" "cluster" {
  name = var.cluster_name
}

# using custom type until https://github.com/OpenNebula/terraform-provider-opennebula/issues/385 is fixed
resource "opennebula_host" "host" {
  for_each   = toset(var.host_name)

  name       = each.key
  type       = "custom"
  cluster_id = opennebula_cluster.cluster.id

  custom {
    virtualization = "qemu"
    information    = "qemu"
  }
}

resource "opennebula_datastore" "datastore" {
  for_each = var.datastore_name

  name = each.key
  type = each.value
  cluster_id = opennebula_cluster.cluster.id

  custom {
    datastore = "fs"
    transfer = "ssh"
  }
}

Variables:

variable "cluster_name" {
  type = string
  description = "Cluster managing a group of hosts. Naming convention: vh<team>-<env>"
  default = "vhinfra-dev"
}

variable "host_name" {
  type = list
  description = "Hosts part of this cluster"
  default = [
    "opennebula_backend_dev",
    "opennebula_backend_dev_2",
  ]
}

variable "datastore_name" {
  type = map
  description = "Datastore attached to hypervisors. This is the equivalent of /data/vm. Naming convention: vh<team>-<env>-small"
  default = {
    "vhinfra-dev-small" = "system",
    "vhinfra-dev-large" = "system",
    "vhinfra-dev-image" = "image",
  }
}

Run 1:

 terraform apply

Terraform used the selected providers to generate the following execution plan. Resource actions are indicated with the following symbols:
  + create

Terraform will perform the following actions:

  # opennebula_cluster.cluster will be created
  + resource "opennebula_cluster" "cluster" {
      + default_tags = (known after apply)
      + id           = (known after apply)
      + name         = "vhinfra-dev"
      + tags_all     = (known after apply)
    }

  # opennebula_datastore.datastore["vhinfra-dev-image"] will be created
  + resource "opennebula_datastore" "datastore" {
      + cluster_id             = (known after apply)
      + default_tags           = (known after apply)
      + driver                 = (known after apply)
      + id                     = (known after apply)
      + name                   = "vhinfra-dev-image"
      + restricted_directories = (known after apply)
      + safe_directories       = (known after apply)
      + tags_all               = (known after apply)
      + type                   = "image"

      + custom {
          + datastore = "fs"
          + transfer  = "ssh"
        }
    }

  # opennebula_datastore.datastore["vhinfra-dev-large"] will be created
  + resource "opennebula_datastore" "datastore" {
      + cluster_id             = (known after apply)
      + default_tags           = (known after apply)
      + driver                 = (known after apply)
      + id                     = (known after apply)
      + name                   = "vhinfra-dev-large"
      + restricted_directories = (known after apply)
      + safe_directories       = (known after apply)
      + tags_all               = (known after apply)
      + type                   = "system"

      + custom {
          + datastore = "fs"
          + transfer  = "ssh"
        }
    }

  # opennebula_datastore.datastore["vhinfra-dev-small"] will be created
  + resource "opennebula_datastore" "datastore" {
      + cluster_id             = (known after apply)
      + default_tags           = (known after apply)
      + driver                 = (known after apply)
      + id                     = (known after apply)
      + name                   = "vhinfra-dev-small"
      + restricted_directories = (known after apply)
      + safe_directories       = (known after apply)
      + tags_all               = (known after apply)
      + type                   = "system"

      + custom {
          + datastore = "fs"
          + transfer  = "ssh"
        }
    }

  # opennebula_host.host["opennebula_backend_dev"] will be created
  + resource "opennebula_host" "host" {
      + cluster_id   = (known after apply)
      + default_tags = (known after apply)
      + id           = (known after apply)
      + name         = "opennebula_backend_dev"
      + tags_all     = (known after apply)
      + type         = "custom"

      + custom {
          + information    = "qemu"
          + virtualization = "qemu"
        }
    }

  # opennebula_host.host["opennebula_backend_dev_2"] will be created
  + resource "opennebula_host" "host" {
      + cluster_id   = (known after apply)
      + default_tags = (known after apply)
      + id           = (known after apply)
      + name         = "opennebula_backend_dev_2"
      + tags_all     = (known after apply)
      + type         = "custom"

      + custom {
          + information    = "qemu"
          + virtualization = "qemu"
        }
    }

Plan: 6 to add, 0 to change, 0 to destroy.

Run 2:

terraform apply
opennebula_cluster.cluster: Refreshing state... [id=123]
opennebula_datastore.datastore["vhinfra-dev-small"]: Refreshing state... [id=203]
opennebula_host.host["opennebula_backend_dev"]: Refreshing state... [id=65]
opennebula_datastore.datastore["vhinfra-dev-large"]: Refreshing state... [id=202]
opennebula_datastore.datastore["vhinfra-dev-image"]: Refreshing state... [id=201]
opennebula_host.host["opennebula_backend_dev_2"]: Refreshing state... [id=64]

Terraform used the selected providers to generate the following execution plan. Resource actions are indicated with the following symbols:
  ~ update in-place

Terraform will perform the following actions:

  # opennebula_cluster.cluster will be updated in-place
  ~ resource "opennebula_cluster" "cluster" {
      ~ datastores       = [
          - 201,
          - 202,
          - 203,
        ]
      ~ hosts            = [
          - 64,
          - 65,
        ]
        id               = "123"
        name             = "vhinfra-dev"
        # (3 unchanged attributes hidden)
    }

Plan: 0 to add, 1 to change, 0 to destroy.

Expected behavior: No change is expected.

treywelsh commented 1 year ago

Thanks for the issue, good point, I forgot to check all inter resources relationships.

Actually it possible to tie a datastore/host to a cluster in two ways and the way it's implemented for now is problematic.

When you attach an host to a cluster via the cluster_id field like you did above, the hosts section of the cluster is still empty in the TF file, but when the provider read the cluster datas on cloud side it see that there is some hosts tied. And then it's sees a diff on the cluster resource and believe it has to remove the hosts and the clusters.

treywelsh commented 1 year ago

We need to remove the direct dependencies between the two resources.

Add Computed behavior (in addition to Optional) on the hosts section of the cluster solve the above case but this doesn't solve this case, there is still a diff after the initial apply:

resource "opennebula_cluster" "cluster" {
  name = "cluster1"

  hosts = [
    opennebula_host.host1.id,
  ]
}

resource "opennebula_host" "host1" {
  name       = "host1"
  type       = "custom"

  custom {
    virtualization = "dummy"
    information    = "dummy"
  }
}

resource "opennebula_host" "host2" {
  name       = "host2"
  type       = "custom"
  cluster_id = opennebula_cluster.cluster.id

  custom {
    virtualization = "dummy"
    information    = "dummy"
  }
}

This may be solved in reading only the configured elements of the hosts section of the cluster resource but with this solution we won't be able to import the hosts and datastores of the resource anymore. In addition, this adds problems when updating the hosts fields of the cluster resource.

Some other ideas:

An other thing that is not consistent and may be confusing in the provider across resources is the way we delete a group VS the way we delete a cluster:

I'll probably open one or two issues (on group and users) after that to have a consistent behavior on membership management between resources

TGM commented 1 year ago

The proposed solution does not allow you to associate existing resources to the cluster.

treywelsh commented 1 year ago

You still have the cluster_id field in datastore and in host resources to manage membership. If we keep two distinct places to manage cluster membership we'll still have some diffs if we use both at the same time like I said with:

resource "opennebula_cluster" "cluster" {
  name = "cluster1"

  hosts = [
    opennebula_host.host1.id,
  ]
}

resource "opennebula_host" "host1" {
  name       = "host1"
  type       = "custom"

  custom {
    virtualization = "dummy"
    information    = "dummy"
  }
}

resource "opennebula_host" "host2" {
  name       = "host2"
  type       = "custom"
  cluster_id = opennebula_cluster.cluster.id

  custom {
    virtualization = "dummy"
    information    = "dummy"
  }
}

To retrieve the list of cluster member I propose to modify cluster fields hosts and datastores from Optional to Computed this would allow to know (but without being able to modify from this place) who is member of the cluster from the cluster resource.

treywelsh commented 1 year ago

We could instead manage membership from the cluster resource but I did this way because:

I had to make a choice, I knew these arguments, but it's not a big deal to manage membership from the cluster resource so if everyone think it's better and I'm open to do like this. If you have another idea feel free to drop an other comment I'm open to discuss and rewrite this PR

TGM commented 1 year ago

The proposed solution is good, but we allso need a way to associate existing resources. For example, associate the default DS with a new cluster.

treywelsh commented 1 year ago

Ok so membership management would be less exclusive from the cluster, PR updated.

TGM commented 1 year ago

Seems legit. Can you add it as a RC release or something so we can test it in pre-prod?

frousselet commented 1 year ago

The 1.1.1-rc1 will be released by the end of the week.

TGM commented 1 year ago

Partially fixed, new datastores are allocated to default cluster as well.

image

resource "opennebula_cluster" "cluster" {
  name       = "${var.cluster_name}-${var.cluster_env}-${var.cluster_cpu_vendor}"
  hosts      = [for host in opennebula_host.host: host.id]
  datastores = [for datastore in opennebula_datastore.datastore: datastore.id]
  # virtual_networks = [145]
}

resource "opennebula_datastore" "datastore" {
  count = var.cluster_datastores

  name       = "${var.cluster_name}-${var.cluster_env}-${var.cluster_cpu_vendor}-ds${count.index + 1}"
  type       = "system"

  custom {
    transfer  = "ssh"
  }
}

resource "opennebula_host" "host" {
  for_each = toset(var.cluster_hosts)

  name       = each.key
  type       = "kvm"
}
treywelsh commented 1 year ago

In the actual way to manage cluster membership that we just modified for the last RC release of the provider:

A new datastore is added to the default cluster due to these points:

To fix this we could:

TGM commented 1 year ago

A few ideeas to consider:

treywelsh commented 1 year ago

Considering your first idea would mean that we use cluster_id in the datastore to specify an other cluster than default. It's like we return to the initial problems of this issue and rollback our last changes.

We could go this way, i.e. manage the dependency from datastore/host side via the cluster_id field (and then deprecate hosts and datastores sections of the cluster resource). Then we need to consider your comment when I already proposed this solution to see how we could solve it in another way: The answer could be: you needs to import the default resources, and then manage their membership via the cluster_id attribute.

It's still possible to rollback these changes as the provider hasn't been release with it's related attributes changes...

TGM commented 1 year ago

Keep the current RC1 changes, they are way better than the previous version. We could go this way in order to solve the problem without any further changes:

  1. data query for the default cluster.
  2. update the default cluster resources to null.

And this might be the best ideea yet. That way we don't loose our current gained flexibility and resources are still managed individually.

Update: just realized that updating the default cluster, won't work without importing the resource ...

TGM commented 1 year ago

It's interesting behavior, that when hosts are created, they are associated correctly only to the new cluster. I wonder why ... image

TGM commented 1 year ago

Related?

opennebula_cluster.cluster: Destroying... [id=174]
╷
│ Error: Failed to delete
│ 
│ cluster (ID: 174): OpenNebula error [ACTION]: [one.cluster.delete] Cannot delete cluster. Cluster 174 is not empty, it contains 2 hosts.
╵
treywelsh commented 1 year ago

Related?

opennebula_cluster.cluster: Destroying... [id=174]
╷
│ Error: Failed to delete
│ 
│ cluster (ID: 174): OpenNebula error [ACTION]: [one.cluster.delete] Cannot delete cluster. Cluster 174 is not empty, it contains 2 hosts.
╵

You can't delete cluster with host/datastore inside it's an OpenNebula constraint and the provider doesn't try add behavior to empty the cluster from it's member before deleting it.

treywelsh commented 1 year ago

It's interesting behavior, that when hosts are created, they are associated correctly only to the new cluster. I wonder why ... image

As I said in this previous comment, an host is only in one cluster at a time (datastore could be in several clusters at the same time), so when you add it to a new cluster, OpenNebula automatically remove it from the other. This is not related to the provider.

TGM commented 1 year ago

Related?

opennebula_cluster.cluster: Destroying... [id=174]
╷
│ Error: Failed to delete
│ 
│ cluster (ID: 174): OpenNebula error [ACTION]: [one.cluster.delete] Cannot delete cluster. Cluster 174 is not empty, it contains 2 hosts.
╵

You can't delete cluster with host/datastore inside it's an OpenNebula constraint and the provider doesn't try add behavior to empty the cluster from it's member before deleting it.

In that case, you can't use terraform destroy.

It seems to ignore depends_on as well. Any ideeas?

treywelsh commented 1 year ago

Infortunately, it seems that Terraform need a bit of help with the dependencies here, it try to delete the cluster first but this doesn't follow OpenNebula rules. Terraform is aware of dependencies so using depends_on you'll probably create a dependency cycle.

With the first idea of managing dependencies from host/datastore I think that terraform would try to delete host and datastores first.

So, for now you need to make it in two apply: first detach the hosts and datastores in a first apply, then destroy the cluster in a second apply.

Otherwise, it's about how we want to design the provider, we could add some optional behavior to the cluster resource to remove the dependencies before deleting the cluster in only one step.

TGM commented 1 year ago

Keep the cluster_id attributes in opennebula_host, opennebula_datastore and opennebula_virtual_network deprecate opennebula_cluster / datastores deprecate opennebula_cluster / hosts deprecate opennebula_cluster / virtual_network