OpenNebula / terraform-provider-opennebula

Terraform provider for OpenNebula
https://www.terraform.io/docs/providers/opennebula/
Mozilla Public License 2.0
65 stars 52 forks source link

Disk resources not recorded in the state file #381

Closed sorinpad closed 1 year ago

sorinpad commented 1 year ago

Community Note

Terraform Version

Terraform v1.3.6
on linux_amd64
+ provider registry.terraform.io/opennebula/opennebula v1.1.0

Affected Resource(s)

Terraform Configuration Files

terraform {
  required_providers {
    opennebula = {
      source = "OpenNebula/opennebula"
      version = "1.1.0"
    }
  }
}

resource "opennebula_virtual_machine" "vm" {
  name        = "testvm"
  description = "test"
  cpu         = 1
  vcpu        = 1
  memory      = 768
  group       = "oneadmin"
  permissions = "660"

  disk {
    image_id = 17
    size     = 4096
    target   = "vda"
    driver   = "qcow2"
  }

  disk {
    image_id = 25
    size     = 3072
    target   = "vdb"
    driver   = "qcow2"
  }

  on_disk_change = "SWAP"
}

Debug Output

Initial run: https://gist.github.com/bartisan/39c5de53228e64f4b734a7e059db2da8#file-opennebula-terraform-provider-run-1 Subsequent runs: https://gist.github.com/bartisan/39c5de53228e64f4b734a7e059db2da8#file-opennebula-terraform-provider-run-2

Panic Output

N/A

Expected Behavior

Disks vda and vdb get attached to the VM and subsequent terraform runs report nothing to apply.

Actual Behavior

Terraform keeps trying to attach disk vdb (or any number of additional disks in disk blocks) and fails as OpenNebula says it's already attached.

Steps to Reproduce

  1. terraform apply shows this output
    
    # opennebula_virtual_machine.vm will be created                                                                                                                                                                                                                        [4/1809]
    + resource "opennebula_virtual_machine" "vm" {
    ...
      + disk {
          + computed_cache           = (known after apply)
          + computed_dev_prefix      = (known after apply)
          + computed_discard         = (known after apply)
          + computed_driver          = (known after apply)
          + computed_io              = (known after apply)
          + computed_size            = (known after apply)
          + computed_target          = (known after apply)
          + computed_volatile_format = (known after apply)
          + disk_id                  = (known after apply)
          + driver                   = "qcow2"
          + image_id                 = 17
          + size                     = 4096
          + target                   = "vda"
        }
      + disk {
          + computed_cache           = (known after apply)
          + computed_dev_prefix      = (known after apply)
          + computed_discard         = (known after apply)
          + computed_driver          = (known after apply)
          + computed_io              = (known after apply)
          + computed_size            = (known after apply)
          + computed_target          = (known after apply)
          + computed_volatile_format = (known after apply)
          + disk_id                  = (known after apply)
          + driver                   = "qcow2"
          + image_id                 = 25
          + size                     = 3072
          + target                   = "vdb"
        }
    ...
      }
    }
    Plan: 1 to add, 0 to change, 0 to destroy.

Do you want to perform these actions? Terraform will perform the actions described above. Only 'yes' will be accepted to approve.

Enter a value: yes

opennebula_virtual_machine.vm: Creating... opennebula_virtual_machine.vm: Still creating... [10s elapsed] opennebula_virtual_machine.vm: Creation complete after 19s [id=149]

Apply complete! Resources: 1 added, 0 changed, 0 destroyed.

2. A secondary `terraform apply` shows secondary disks as needed to be created.  Creation fails as the disk is already reported as attached in OpenNebul.a

opennebula_virtual_machine.vm: Refreshing state... [id=149]

Terraform used the selected providers to generate the following execution plan. Resource actions are indicated with the following symbols: ~ update in-place

Terraform will perform the following actions:

opennebula_virtual_machine.vm will be updated in-place

~ resource "opennebula_virtual_machine" "vm" { id = "149" name = "testvm"

(22 unchanged attributes hidden)

  + disk {
      + driver   = "qcow2"
      + image_id = 25
      + size     = 3072
      + target   = "vdb"
    }

    # (1 unchanged block hidden)
}

Plan: 0 to add, 1 to change, 0 to destroy.

Do you want to perform these actions? Terraform will perform the actions described above. Only 'yes' will be accepted to approve.

Enter a value: yes

opennebula_virtual_machine.vm: Modifying... [id=149] ╷ │ Error: Failed to update disk │ │ with opennebula_virtual_machine.vm, │ on provider.tf line 17, in resource "opennebula_virtual_machine" "vm": │ 17: resource "opennebula_virtual_machine" "vm" { │ │ virtual machine (ID: 149): vm disk attach: can't attach image to virtual machine (ID:149): OpenNebula error [ACTION]: [one.vm.attach] Target vdb is already in use.



### Important Factoids

<!--- Are there anything atypical about your accounts that we should know? For example: Running in EC2 Classic? --->
* After step 1 a single disk shows up in the state file; after step 2 the secondary disk shows up but misses all `computed_*` attributes except for `computed_size=0`
* Removing `target` from the disk block goes thru and any new terraform run keeps adding one extra disk to the VM.
* Tested with all previous releases up to `0.5.2` with the same result.

### References

<!---
Information about referencing Github Issues: https://help.github.com/articles/basic-writing-and-formatting-syntax/#referencing-issues-and-pull-requests

Are there any other GitHub issues (open or closed) or pull requests that should be linked here? Vendor documentation? For example:
--->

- #264 
treywelsh commented 1 year ago

From your log file the interesting errors is here in the first step: https://gist.github.com/bartisan/39c5de53228e64f4b734a7e059db2da8#file-opennebula-terraform-provider-run-1-L824

It means that the provider is not able to match the vdb disk on cloud side with it's descriptions in the TF file. However this doesn't provide enough details on the exact error so I need to reproduce the problem to investigate more.

Infortunately I didn't reproduce the problem, here is my config:

Terraform v1.3.6 With the provider release 1.1.0 Opennebula 6.4.0 (deployed via minione)

Here is my test file (I had to add images descriptions to test in my dev environment):

resource "opennebula_image" "test" {
  name         = "test"
  datastore_id = 1
  type         = "DATABLOCK"
  size         = "4096"
  dev_prefix   = "vd"
  driver       = "raw"
  permissions  = "660"

  tags = {
    billable = "true"
  }
}
resource "opennebula_image" "test2" {
  name         = "test2"
  datastore_id = 1
  type         = "DATABLOCK"
  size         = "3072"
  dev_prefix   = "vd"
  driver       = "raw"
  permissions  = "660"

  tags = {
    billable = "true"
  }
}

resource "opennebula_virtual_machine" "vm" {
  name        = "testvm"
  description = "test"
  cpu         = 1
  vcpu        = 1
  memory      = 768
  group       = "oneadmin"
  permissions = "660"

  disk {
    image_id = opennebula_image.test.id
    size     = 4096
    target   = "vda"
    driver   = "qcow2"
  }

  disk {
    image_id = opennebula_image.test2.id
    size     = 3072
    target   = "vdb"
    driver   = "qcow2"
  }

  on_disk_change = "SWAP"
}

I may miss something, do you reproduce the problem with my test file ?

sorinpad commented 1 year ago

Hey, @treywelsh,

Yes, I could reproduce the problem using your test file; only used a different size for the images as I don't have that much space on the default datastore.

Didn't mention it initially, also running on OpenNebula 6.4.0 (deployed via minione).

Initial run:

  # opennebula_virtual_machine.vm will be created                                                                                                                                                                                                                       [53/1430]
  + resource "opennebula_virtual_machine" "vm" {
      + cpu            = 1                           
      + default_tags   = (known after apply)
      + description    = "test"
      + gid            = (known after apply)
      + gname          = (known after apply)
      + group          = "oneadmin"
      + hard_shutdown  = false    
      + id             = (known after apply)          
      + ip             = (known after apply)           
      + lcmstate       = (known after apply)              
      + memory         = 768                               
      + name           = "testvm"         
      + on_disk_change = "SWAP"                               
      + pending        = false                                     
      + permissions    = "660"
      + state          = (known after apply)               
      + tags_all       = (known after apply)      
      + template_disk  = (known after apply)       
      + template_id    = -1                       
      + template_nic   = (known after apply)               
      + template_tags  = (known after apply)
      + timeout        = 20                                                                                                                                                                                                                                                      
      + uid            = (known after apply)
      + uname          = (known after apply)
      + vcpu           = 1                   

      + disk {                                            
          + computed_cache           = (known after apply)
          + computed_dev_prefix      = (known after apply)
          + computed_discard         = (known after apply)
          + computed_driver          = (known after apply)
          + computed_io              = (known after apply)
          + computed_size            = (known after apply)
          + computed_target          = (known after apply)
          + computed_volatile_format = (known after apply)
          + disk_id                  = (known after apply)
          + driver                   = "qcow2"
          + image_id                 = (known after apply)
          + size                     = 4096
          + target                   = "vda"
        }                
      + disk {             
          + computed_cache           = (known after apply)
          + computed_dev_prefix      = (known after apply)
          + computed_discard         = (known after apply)
          + computed_driver          = (known after apply)
          + computed_io              = (known after apply)
          + computed_size            = (known after apply)
          + computed_target          = (known after apply)
          + computed_volatile_format = (known after apply)
          + disk_id                  = (known after apply)
          + driver                   = "qcow2"
          + image_id                 = (known after apply)
          + size                     = 3072
          + target                   = "vdb"
        }

      + vmgroup {
          + role       = (known after apply)
          + vmgroup_id = (known after apply)
        }
    }

Plan: 3 to add, 0 to change, 0 to destroy.

Subsequent run:

  # opennebula_virtual_machine.vm will be updated in-place
  ~ resource "opennebula_virtual_machine" "vm" {
        id             = "158"
        name           = "testvm"
        # (22 unchanged attributes hidden)

      + disk {
          + driver   = "qcow2"
          + image_id = 28
          + size     = 4096
          + target   = "vda"
        }
      + disk {
          + driver   = "qcow2"
          + image_id = 29
          + size     = 3072
          + target   = "vdb"
        }
    }

Plan: 0 to add, 1 to change, 0 to destroy.
TGM commented 1 year ago

Reproducible with terraform 1.3.6 and ON provider 1.0.2+

MrFreezeex commented 1 year ago

Hi @treywelsh, thanks for the hint of [WARN] Configuration for disk ID it helped a lot, we found the issue thanks to that.

So the issue was that the disk was ignored because of the driver difference between the image and the vm disk (qcow2 vs raw). I am not entirely sure why you don't have this problem with the terraform code that you gave us but in our cluster the disk driver become raw instead of qcow2 (because the image is also raw) and then the disk is ignored.

So for our use-case we do have a nice workaround which is to correctly set the driver, but I think the provider should probably shows that it tries to remove the disk if it's not matched instead of ignoring it on update? This would probably give a hint of the problem to the operator... I am happy to give a try at fixing this, but I can't promise when unfortunately :(.

treywelsh commented 1 year ago

Thanks for the details it helps, I'm playing with the driver values, it seems that I'm able to reproduce in some cases (from the problem you describe)

To be sure I was clear on what's happening: The two disks are attached on cloud side if you look in sunstone after VM creation step (first step). Then, after creating the VM, the provider fetch the whole VM configuration from OpenNebula to read it. The disks and nics code reading parts are trickier than for other attributes and there is a problem during the read step: the provider is able to recognize only one of the two disks (it compares TF description to cloud side VM informations via attributes values matching) and then it read only one disk description from cloud side. The consequence is: the provider believe that only one disk is attached and this is why it tries to attach again a disk. But the disk is already attached and we have some conflict errors like Target vdb is already in use.

Not sure it's only a provider bug (should be discussed) if we consider that OpenNebula receive a disk driver value from the provider, but apply an other value instead without returning an error or giving a hint to the provider on what's happening. Currently the provider is not able understand why OpenNebula didn't applied the disk with the provided attributes, it just believe it's an other disk that he don't know.

We could try to make the provider more tolerant by relaxing a bit the attribute comparison if it break nothing else, or just consider that current reading code doesn't properly work and rewrite it.

Personally I won't refactor disk/nic code parts without deeper changes in the provider or if we consider a full rewrite. I described some ideas in this comment

treywelsh commented 1 year ago

I can give you more details on how disk/nic management currently works in the provider if needed, feel free to share your thoughts or contribute if you think you have a better idea, any help/input is appreciated

github-actions[bot] commented 1 year ago

This issue is stale because it has been open for 30 days with no activity and it has not the 'status: confirmed' label or it is not in a milestone. Remove the 'status: stale' label or comment, or this will be closed in 5 days.