opennebula_cluster removes resources on secondary run

TGM commented 1 year ago

OpenNebula version: 6.4.0 terraform-provider-opennebula: 1.1.0

On the second run the opennebula_cluster are removed from cluster.

Plan:

resource "opennebula_cluster" "cluster" {
  name = var.cluster_name
}

# using custom type until https://github.com/OpenNebula/terraform-provider-opennebula/issues/385 is fixed
resource "opennebula_host" "host" {
  for_each   = toset(var.host_name)

  name       = each.key
  type       = "custom"
  cluster_id = opennebula_cluster.cluster.id

  custom {
    virtualization = "qemu"
    information    = "qemu"
  }
}

resource "opennebula_datastore" "datastore" {
  for_each = var.datastore_name

  name = each.key
  type = each.value
  cluster_id = opennebula_cluster.cluster.id

  custom {
    datastore = "fs"
    transfer = "ssh"
  }
}

Variables:

variable "cluster_name" {
  type = string
  description = "Cluster managing a group of hosts. Naming convention: vh<team>-<env>"
  default = "vhinfra-dev"
}

variable "host_name" {
  type = list
  description = "Hosts part of this cluster"
  default = [
    "opennebula_backend_dev",
    "opennebula_backend_dev_2",
  ]
}

variable "datastore_name" {
  type = map
  description = "Datastore attached to hypervisors. This is the equivalent of /data/vm. Naming convention: vh<team>-<env>-small"
  default = {
    "vhinfra-dev-small" = "system",
    "vhinfra-dev-large" = "system",
    "vhinfra-dev-image" = "image",
  }
}

Run 1:

 terraform apply

Terraform used the selected providers to generate the following execution plan. Resource actions are indicated with the following symbols:
  + create

Terraform will perform the following actions:

  # opennebula_cluster.cluster will be created
  + resource "opennebula_cluster" "cluster" {
      + default_tags = (known after apply)
      + id           = (known after apply)
      + name         = "vhinfra-dev"
      + tags_all     = (known after apply)
    }

  # opennebula_datastore.datastore["vhinfra-dev-image"] will be created
  + resource "opennebula_datastore" "datastore" {
      + cluster_id             = (known after apply)
      + default_tags           = (known after apply)
      + driver                 = (known after apply)
      + id                     = (known after apply)
      + name                   = "vhinfra-dev-image"
      + restricted_directories = (known after apply)
      + safe_directories       = (known after apply)
      + tags_all               = (known after apply)
      + type                   = "image"

      + custom {
          + datastore = "fs"
          + transfer  = "ssh"
        }
    }

  # opennebula_datastore.datastore["vhinfra-dev-large"] will be created
  + resource "opennebula_datastore" "datastore" {
      + cluster_id             = (known after apply)
      + default_tags           = (known after apply)
      + driver                 = (known after apply)
      + id                     = (known after apply)
      + name                   = "vhinfra-dev-large"
      + restricted_directories = (known after apply)
      + safe_directories       = (known after apply)
      + tags_all               = (known after apply)
      + type                   = "system"

      + custom {
          + datastore = "fs"
          + transfer  = "ssh"
        }
    }

  # opennebula_datastore.datastore["vhinfra-dev-small"] will be created
  + resource "opennebula_datastore" "datastore" {
      + cluster_id             = (known after apply)
      + default_tags           = (known after apply)
      + driver                 = (known after apply)
      + id                     = (known after apply)
      + name                   = "vhinfra-dev-small"
      + restricted_directories = (known after apply)
      + safe_directories       = (known after apply)
      + tags_all               = (known after apply)
      + type                   = "system"

      + custom {
          + datastore = "fs"
          + transfer  = "ssh"
        }
    }

  # opennebula_host.host["opennebula_backend_dev"] will be created
  + resource "opennebula_host" "host" {
      + cluster_id   = (known after apply)
      + default_tags = (known after apply)
      + id           = (known after apply)
      + name         = "opennebula_backend_dev"
      + tags_all     = (known after apply)
      + type         = "custom"

      + custom {
          + information    = "qemu"
          + virtualization = "qemu"
        }
    }

  # opennebula_host.host["opennebula_backend_dev_2"] will be created
  + resource "opennebula_host" "host" {
      + cluster_id   = (known after apply)
      + default_tags = (known after apply)
      + id           = (known after apply)
      + name         = "opennebula_backend_dev_2"
      + tags_all     = (known after apply)
      + type         = "custom"

      + custom {
          + information    = "qemu"
          + virtualization = "qemu"
        }
    }

Plan: 6 to add, 0 to change, 0 to destroy.

Run 2:

terraform apply
opennebula_cluster.cluster: Refreshing state... [id=123]
opennebula_datastore.datastore["vhinfra-dev-small"]: Refreshing state... [id=203]
opennebula_host.host["opennebula_backend_dev"]: Refreshing state... [id=65]
opennebula_datastore.datastore["vhinfra-dev-large"]: Refreshing state... [id=202]
opennebula_datastore.datastore["vhinfra-dev-image"]: Refreshing state... [id=201]
opennebula_host.host["opennebula_backend_dev_2"]: Refreshing state... [id=64]

Terraform used the selected providers to generate the following execution plan. Resource actions are indicated with the following symbols:
  ~ update in-place

Terraform will perform the following actions:

  # opennebula_cluster.cluster will be updated in-place
  ~ resource "opennebula_cluster" "cluster" {
      ~ datastores       = [
          - 201,
          - 202,
          - 203,
        ]
      ~ hosts            = [
          - 64,
          - 65,
        ]
        id               = "123"
        name             = "vhinfra-dev"
        # (3 unchanged attributes hidden)
    }

Plan: 0 to add, 1 to change, 0 to destroy.

Expected behavior: No change is expected.

treywelsh commented 1 year ago

Thanks for the issue, good point, I forgot to check all inter resources relationships.

Actually it possible to tie a datastore/host to a cluster in two ways and the way it's implemented for now is problematic.

When you attach an host to a cluster via the cluster_id field like you did above, the hosts section of the cluster is still empty in the TF file, but when the provider read the cluster datas on cloud side it see that there is some hosts tied. And then it's sees a diff on the cluster resource and believe it has to remove the hosts and the clusters.

treywelsh commented 1 year ago

We need to remove the direct dependencies between the two resources.

Add Computed behavior (in addition to Optional) on the hosts section of the cluster solve the above case but this doesn't solve this case, there is still a diff after the initial apply:

resource "opennebula_cluster" "cluster" {
  name = "cluster1"

  hosts = [
    opennebula_host.host1.id,
  ]
}

resource "opennebula_host" "host1" {
  name       = "host1"
  type       = "custom"

  custom {
    virtualization = "dummy"
    information    = "dummy"
  }
}

resource "opennebula_host" "host2" {
  name       = "host2"
  type       = "custom"
  cluster_id = opennebula_cluster.cluster.id

  custom {
    virtualization = "dummy"
    information    = "dummy"
  }
}

This may be solved in reading only the configured elements of the hosts section of the cluster resource but with this solution we won't be able to import the hosts and datastores of the resource anymore. In addition, this adds problems when updating the hosts fields of the cluster resource.

Some other ideas:

removing cluster_id from the host resource, or removing hosts section from the cluster resources. We had the same problem with user and group membership in #191 and this was solve in this way
create new resources cluster_host and cluster_datastore, to extract the cluster resource dependencies and deprecate hosts and datastores sections. However we still have the diff with the TF shown above

An other thing that is not consistent and may be confusing in the provider across resources is the way we delete a group VS the way we delete a cluster:

for now deleting a group will previously remove each users member
but deleting a cluster will fails because this wont remove each hosts and datastores members However this should be tracked in an other issue...

I'll probably open one or two issues (on group and users) after that to have a consistent behavior on membership management between resources

TGM commented 1 year ago

The proposed solution does not allow you to associate existing resources to the cluster.

treywelsh commented 1 year ago

You still have the cluster_id field in datastore and in host resources to manage membership. If we keep two distinct places to manage cluster membership we'll still have some diffs if we use both at the same time like I said with:

resource "opennebula_cluster" "cluster" {
  name = "cluster1"

  hosts = [
    opennebula_host.host1.id,
  ]
}

resource "opennebula_host" "host1" {
  name       = "host1"
  type       = "custom"

  custom {
    virtualization = "dummy"
    information    = "dummy"
  }
}

resource "opennebula_host" "host2" {
  name       = "host2"
  type       = "custom"
  cluster_id = opennebula_cluster.cluster.id

  custom {
    virtualization = "dummy"
    information    = "dummy"
  }
}

To retrieve the list of cluster member I propose to modify cluster fields hosts and datastores from Optional to Computed this would allow to know (but without being able to modify from this place) who is member of the cluster from the cluster resource.

treywelsh commented 1 year ago

We could instead manage membership from the cluster resource but I did this way because:

this is more consistent with how I did with group and users similar membership problem
the XML-RPC method to allocate a datastore and an host take a cluster_id as a parameter. This parameter could be -1 to leave the resource alone, but managing membership from the cluster resource would add an additional step: allocate host/datastore with cluster_id to -1, then add the resource to the cluster.
manage membership from the cluster resource require a bit more complex code

I had to make a choice, I knew these arguments, but it's not a big deal to manage membership from the cluster resource so if everyone think it's better and I'm open to do like this. If you have another idea feel free to drop an other comment I'm open to discuss and rewrite this PR

TGM commented 1 year ago

The proposed solution is good, but we allso need a way to associate existing resources. For example, associate the default DS with a new cluster.

treywelsh commented 1 year ago

Ok so membership management would be less exclusive from the cluster, PR updated.

TGM commented 1 year ago

Seems legit. Can you add it as a RC release or something so we can test it in pre-prod?

frousselet commented 1 year ago

The 1.1.1-rc1 will be released by the end of the week.

TGM commented 1 year ago

Partially fixed, new datastores are allocated to default cluster as well.

resource "opennebula_cluster" "cluster" {
  name       = "${var.cluster_name}-${var.cluster_env}-${var.cluster_cpu_vendor}"
  hosts      = [for host in opennebula_host.host: host.id]
  datastores = [for datastore in opennebula_datastore.datastore: datastore.id]
  # virtual_networks = [145]
}

resource "opennebula_datastore" "datastore" {
  count = var.cluster_datastores

  name       = "${var.cluster_name}-${var.cluster_env}-${var.cluster_cpu_vendor}-ds${count.index + 1}"
  type       = "system"

  custom {
    transfer  = "ssh"
  }
}

resource "opennebula_host" "host" {
  for_each = toset(var.cluster_hosts)

  name       = each.key
  type       = "kvm"
}

treywelsh commented 1 year ago

In the actual way to manage cluster membership that we just modified for the last RC release of the provider:

A new datastore is added to the default cluster due to these points:

the cluster ID is a mandatory parameter of the OpenNebula API when creating a datastore (same for host API), but it's no more user configurable in the last provider RC, so we set it to -1 by default (this add the datastore to the default cluster)
A datastore could be set in several clusters, so adding it to a cluster won't remove it from an other cluster (an host could be member of only one datastore at a time, it's why we don't have this behavior for the host resource).

To fix this we could:

import the default cluster to manage it's members
consider that as soon as we add the datastore to a cluster other than default, we could remove it from the default cluster

TGM commented 1 year ago

A few ideeas to consider:

verify that the resource is associated to another cluster on creation and do not allocate it to the default cluster
add a flag named remove from default and if it's added to another cluster than remove it from default

treywelsh commented 1 year ago

Considering your first idea would mean that we use cluster_id in the datastore to specify an other cluster than default. It's like we return to the initial problems of this issue and rollback our last changes.

We could go this way, i.e. manage the dependency from datastore/host side via the cluster_id field (and then deprecate hosts and datastores sections of the cluster resource). Then we need to consider your comment when I already proposed this solution to see how we could solve it in another way: The answer could be: you needs to import the default resources, and then manage their membership via the cluster_id attribute.

It's still possible to rollback these changes as the provider hasn't been release with it's related attributes changes...

TGM commented 1 year ago

Keep the current RC1 changes, they are way better than the previous version. We could go this way in order to solve the problem without any further changes:

data query for the default cluster.
update the default cluster resources to null.

And this might be the best ideea yet. That way we don't loose our current gained flexibility and resources are still managed individually.

Update: just realized that updating the default cluster, won't work without importing the resource ...

TGM commented 1 year ago

It's interesting behavior, that when hosts are created, they are associated correctly only to the new cluster. I wonder why ...

TGM commented 1 year ago

Related?

opennebula_cluster.cluster: Destroying... [id=174]
╷
│ Error: Failed to delete
│ 
│ cluster (ID: 174): OpenNebula error [ACTION]: [one.cluster.delete] Cannot delete cluster. Cluster 174 is not empty, it contains 2 hosts.
╵

treywelsh commented 1 year ago

Related?

opennebula_cluster.cluster: Destroying... [id=174]
╷
│ Error: Failed to delete
│ 
│ cluster (ID: 174): OpenNebula error [ACTION]: [one.cluster.delete] Cannot delete cluster. Cluster 174 is not empty, it contains 2 hosts.
╵

You can't delete cluster with host/datastore inside it's an OpenNebula constraint and the provider doesn't try add behavior to empty the cluster from it's member before deleting it.

treywelsh commented 1 year ago

It's interesting behavior, that when hosts are created, they are associated correctly only to the new cluster. I wonder why ...

As I said in this previous comment, an host is only in one cluster at a time (datastore could be in several clusters at the same time), so when you add it to a new cluster, OpenNebula automatically remove it from the other. This is not related to the provider.

TGM commented 1 year ago

Related?
opennebula_cluster.cluster: Destroying... [id=174]
╷
│ Error: Failed to delete
│ 
│ cluster (ID: 174): OpenNebula error [ACTION]: [one.cluster.delete] Cannot delete cluster. Cluster 174 is not empty, it contains 2 hosts.
╵
You can't delete cluster with host/datastore inside it's an OpenNebula constraint and the provider doesn't try add behavior to empty the cluster from it's member before deleting it.

In that case, you can't use terraform destroy.

It seems to ignore depends_on as well. Any ideeas?

treywelsh commented 1 year ago

Infortunately, it seems that Terraform need a bit of help with the dependencies here, it try to delete the cluster first but this doesn't follow OpenNebula rules. Terraform is aware of dependencies so using depends_on you'll probably create a dependency cycle.

With the first idea of managing dependencies from host/datastore I think that terraform would try to delete host and datastores first.

So, for now you need to make it in two apply: first detach the hosts and datastores in a first apply, then destroy the cluster in a second apply.

Otherwise, it's about how we want to design the provider, we could add some optional behavior to the cluster resource to remove the dependencies before deleting the cluster in only one step.

TGM commented 1 year ago

Keep the cluster_id attributes in opennebula_host, opennebula_datastore and opennebula_virtual_network deprecate opennebula_cluster / datastores deprecate opennebula_cluster / hosts deprecate opennebula_cluster / virtual_network

OpenNebula / terraform-provider-opennebula

opennebula_cluster removes resources on secondary run #389