Telmate / terraform-provider-proxmox

Terraform provider plugin for proxmox
MIT License
2.15k stars 517 forks source link

v3.0.1-rc2: Stuck on still creating #1015

Closed toybas closed 4 months ago

toybas commented 4 months ago

First of all I would like to thank you for such a great job in creating this provider for Proxmox, it rocks!

With that said I'm currently have one issue stopping me for using it fully. I'm trying to deploy some VMs with v3.0.1-rc2 but the the provider hangs in "Still creating...".

The VM is created successfully in Proxmox but the terraform script hangs for about 9m10s with the message "Still creating...", before it eventually stops with this error: Error: cannot find any IP address

I am not using cloud-init and the os_type is set to "Linux 6.x - 2.6 Kernel". Agent is activated with agent = 1.

After some debugging of the provider it seems that the error stems from the VM not receiving an ipv6 address (which is to be expected since ipv6 is not configured).

I have tried to set skip_ipv6=true but this doesn't help (the value is ignored). Is this setting only available for cloud-init configurations? If I change hasRequiredIP to the below in heper_qemu.go it works as expected.

func (conn connectionInfo) hasRequiredIP() bool {
    if conn.IPs.IPv4 != "" || conn.IPs.IPv6 != "" {
        return true
    }   
    return true
    // if conn.IPs.IPv4 == "" && !conn.SkipIPv4 || conn.IPs.IPv6 == "" && !conn.SkipIPv6 {
    //  return false
    // }
    // return true
}

I'm not so well versed with all the details in this provider but why must both ipv4 and ipv6 be mandatory. Is it not enough if one of them is set?

Tinyblargon commented 4 months ago

@toybas On your second attempt did you recreate the VM? Once the guest agent is functioning we retry for 60 seconds to get an IP. If the agent didn't respond with an IP within the given time frame it will give the error cannot find any IP address https://github.com/Telmate/terraform-provider-proxmox/blob/be2585614955f7ada0433c9a1fe34e604bb717cd/proxmox/resource_vm_qemu.go#L1884-L1886

Notes:

toybas commented 4 months ago

I have tried to destroy the VM and re-create it, but I have also tried running the terraform script again after it has been created with a previous run.

If I re-create the VM (and manually drop it and then re-run the script) I get same error as I described in the issue. If the VM is already created by the provider (and the qemu-guest-agent is running) I get the following when doing a second run:

proxmox_vm_qemu.kubernetes[0]: Refreshing state... [id=/qemu/xxx]

It hangs in this state and nothing happens.

When doing the same with the custom code according to the example I gave in the issue, it works, and I get this on the second run:

proxmox_vm_qemu.kubernetes[0]: Refreshing state... [id=/qemu/xxx]

No changes. Your infrastructure matches the configuration.

Terraform has compared your real infrastructure against your configuration and found no differences, so no changes are needed.

Apply complete! Resources: 0 added, 0 changed, 0 destroyed.

With the changes I made it I can also destroy VMs with terraform (e.g. terraform apply -destroy)

proxmox_vm_qemu.kubernetes[0]: Destroying... [id=[id=/qemu/xxx] proxmox_vm_qemu.kubernetes[0]: Still destroying... [id=/qemu/xxx, 10s elapsed] proxmox_vm_qemu.kubernetes[0]: Still destroying... [id=/qemu/xxx, 20s elapsed] proxmox_vm_qemu.kubernetes[0]: Still destroying... [id=/qemu/xxx, 30s elapsed] proxmox_vm_qemu.kubernetes[0]: Still destroying... [id=/qemu/xxx, 40s elapsed] proxmox_vm_qemu.kubernetes[0]: Still destroying... [id=/qemu/xxx, 50s elapsed] proxmox_vm_qemu.kubernetes[0]: Still destroying... [id=/qemu/xxx, 1m0s elapsed] proxmox_vm_qemu.kubernetes[0]: Still destroying... [id=/qemu/xxx, 1m10s elapsed] proxmox_vm_qemu.kubernetes[0]: Still destroying... [id=/qemu/xxx, 1m20s elapsed] proxmox_vm_qemu.kubernetes[0]: Still destroying... [id=/qemu/xxx, 1m30s elapsed] proxmox_vm_qemu.kubernetes[0]: Destruction complete after 1m35s

So in summary I believe the terraform provider can communicate with the qemu-guest-agent but it only receives an ipv4 address and this is why its not working.

skip_ipv6=true could be a way out, but at the moment this value is ignored in my configuration (when debugging this value is false even though its set to true in the configuration).

Tinyblargon commented 4 months ago

I now understand whats happening here. Before we update the VM we read the current config of the VM, during this read step the error occurs. Because the error occurs during the read step, non of the new settings have been applied. Therefore you cannot influence how this step behaves.

Being unable to obtain an IP should give a warning instead of an error. This was changed everywhere except for: https://github.com/Telmate/terraform-provider-proxmox/blob/be2585614955f7ada0433c9a1fe34e604bb717cd/proxmox/resource_vm_qemu.go#L1886

could you try with #1016

mwiget commented 4 months ago

I also ran into the issue of Error: cannot find any IP address using v3.01-rc2 when deploying more than one VM using terraform count. The VM's however got IPv4 and IPv6 just fine and they also get reported via Proxmox UI. No issues seen when running with count=1.

Plan: 12 to add, 0 to change, 0 to destroy.
module.appstack.time_offset.exp_time: Creating...
module.appstack.time_offset.exp_time: Creation complete after 0s [id=2024-05-26T13:34:40Z]
volterra_token.site: Creating...
module.appstack.volterra_k8s_cluster.appstack: Creating...
volterra_token.site: Creation complete after 0s [id=d06cdd5f-2699-4282-bb2e-3622b0fc92cb]
module.appstack.proxmox_cloud_init_disk.asm-ci[1]: Creating...
module.appstack.proxmox_cloud_init_disk.asm-ci[0]: Creating...
module.appstack.proxmox_cloud_init_disk.asm-ci[0]: Creation complete after 0s [id=cephfs:iso/tf-ci-mw-appstack-1-m0.iso]
module.appstack.proxmox_cloud_init_disk.asm-ci[1]: Creation complete after 0s [id=cephfs:iso/tf-ci-mw-appstack-1-m1.iso]
module.appstack.proxmox_vm_qemu.asm-vm[1]: Creating...
module.appstack.proxmox_vm_qemu.asm-vm[0]: Creating...
module.appstack.volterra_k8s_cluster.appstack: Creation complete after 1s [id=0fda4834-0996-4e7c-8e95-34da8caa97fa]
module.appstack.proxmox_vm_qemu.asm-vm[1]: Still creating... [10s elapsed]
module.appstack.proxmox_vm_qemu.asm-vm[0]: Still creating... [10s elapsed]
module.appstack.proxmox_vm_qemu.asm-vm[0]: Still creating... [20s elapsed]
module.appstack.proxmox_vm_qemu.asm-vm[1]: Still creating... [20s elapsed]
module.appstack.proxmox_vm_qemu.asm-vm[0]: Still creating... [30s elapsed]
module.appstack.proxmox_vm_qemu.asm-vm[1]: Still creating... [30s elapsed]
module.appstack.proxmox_vm_qemu.asm-vm[1]: Still creating... [40s elapsed]
module.appstack.proxmox_vm_qemu.asm-vm[0]: Still creating... [40s elapsed]
module.appstack.proxmox_vm_qemu.asm-vm[0]: Still creating... [50s elapsed]
module.appstack.proxmox_vm_qemu.asm-vm[1]: Still creating... [50s elapsed]
module.appstack.proxmox_vm_qemu.asm-vm[0]: Still creating... [1m0s elapsed]
╷
│ Error: cannot find any IP address
│
│   with module.appstack.proxmox_vm_qemu.asm-vm[0],
│   on proxmox/appstack_master_vm.tf line 1, in resource "proxmox_vm_qemu" "asm-vm":
│    1: resource "proxmox_vm_qemu" "asm-vm" {
│
╵
╷
│ Error: cannot find any IP address
│
│   with module.appstack.proxmox_vm_qemu.asm-vm[1],
│   on proxmox/appstack_master_vm.tf line 1, in resource "proxmox_vm_qemu" "asm-vm":
│    1: resource "proxmox_vm_qemu" "asm-vm" {
│

Long story short, when I build and use https://github.com/Telmate/terraform-provider-proxmox/pull/1016, it works:

module.appstack[0].time_offset.exp_time: Creating...
module.appstack[0].time_offset.exp_time: Creation complete after 0s [id=2024-05-26T14:15:57Z]
module.appstack[0].proxmox_cloud_init_disk.asm-ci[1]: Creating...
module.appstack[0].volterra_k8s_cluster.appstack: Creating...
module.appstack[0].proxmox_cloud_init_disk.asm-ci[2]: Creating...
module.appstack[0].proxmox_cloud_init_disk.asm-ci[0]: Creating...
module.appstack[0].volterra_k8s_cluster.appstack: Creation complete after 1s [id=df02092c-dbd4-45b8-bccc-d0ff6886fae4]
module.appstack[0].proxmox_cloud_init_disk.asm-ci[2]: Creation complete after 2s [id=cephfs:iso/tf-ci-mw-appstack-1-m2.iso]
module.appstack[0].proxmox_cloud_init_disk.asm-ci[0]: Creation complete after 2s [id=cephfs:iso/tf-ci-mw-appstack-1-m0.iso]
module.appstack[0].proxmox_cloud_init_disk.asm-ci[1]: Creation complete after 2s [id=cephfs:iso/tf-ci-mw-appstack-1-m1.iso]
module.appstack[0].proxmox_vm_qemu.asm-vm[2]: Creating...
module.appstack[0].proxmox_vm_qemu.asm-vm[1]: Creating...
module.appstack[0].proxmox_vm_qemu.asm-vm[0]: Creating...
module.appstack[0].proxmox_vm_qemu.asm-vm[1]: Creation complete after 10s [id=prox1/qemu/101]
module.appstack[0].proxmox_vm_qemu.asm-vm[0]: Creation complete after 10s [id=prox1/qemu/100]
module.appstack[0].proxmox_vm_qemu.asm-vm[2]: Still creating... [10s elapsed]
module.appstack[0].proxmox_vm_qemu.asm-vm[2]: Still creating... [20s elapsed]
module.appstack[0].proxmox_vm_qemu.asm-vm[2]: Still creating... [30s elapsed]
module.appstack[0].proxmox_vm_qemu.asm-vm[2]: Still creating... [40s elapsed]
module.appstack[0].proxmox_vm_qemu.asm-vm[2]: Creation complete after 44s [id=prox1/qemu/102]
module.appstack[0].volterra_voltstack_site.appstack: Creating...
module.appstack[0].volterra_voltstack_site.appstack: Creation complete after 1s [id=97264239-f830-469a-b44e-a5db6975207b]
module.appstack[0].module.site_wait_for_online.null_resource.check_site_status_token[0]: Creating...
module.appstack[0].volterra_registration_approval.master[1]: Creating...
module.appstack[0].volterra_registration_approval.master[0]: Creating...
module.appstack[0].module.site_wait_for_online.null_resource.check_site_status_token[0]: Provisioning with 'local-exec'...
. . . 
PatrikKoptik commented 4 months ago

I am encountering the same issue. I would like to see the PR in the new Release Candidate! Thanks for the work.

toybas commented 4 months ago

1016 works somewhat better. Essentially the timeout for how long the wait for an IP has shortened, but it still reports "no static IP address is was found".

When doing terraform apply the first time

... ... ... proxmox_vm_qemu.kubernetes[0]: Creation complete after 5m2s [id=xxx/qemu/xxx] ╷ │ Warning: Qemu Guest Agent is enabled but no IP config is found │ │ with proxmox_vm_qemu.kubernetes[0], │ on main.tf line 28, in resource "proxmox_vm_qemu" "kubernetes": │ 28: resource "proxmox_vm_qemu" "kubernetes" { │ │ Qemu Guest Agent is enabled in your configuration but no static IP address is was found before the time ran out, increasing 'agent_timeout' could resolve this issue ╵

Apply complete! Resources: 1 added, 0 changed, 0 destroyed.

When doing terraform apply a second time:

proxmox_vm_qemu.kubernetes[0]: Refreshing state... [id=xxx/qemu/xxx]

No changes. Your infrastructure matches the configuration.

Terraform has compared your real infrastructure against your configuration and found no differences, so no changes are needed. ╷ │ Warning: Qemu Guest Agent is enabled but no IP config is found │ │ with proxmox_vm_qemu.kubernetes[0], │ on main.tf line 28, in resource "proxmox_vm_qemu" "kubernetes": │ 28: resource "proxmox_vm_qemu" "kubernetes" { │ │ Qemu Guest Agent is enabled in your configuration but no static IP address is was found before the time ran out, increasing 'agent_timeout' could resolve this issue ╵

Apply complete! Resources: 0 added, 0 changed, 0 destroyed.

As mentioned before, the ipv4 address is found, which I can confirm while debugging. Even though _skipipv6=true is configured the value of skip_ipv6 always remains false while debugging.

Isn't the purpose of the setting _skipipv6 = true to skip waiting for a ipv6 address and settle only with ipv4?

Tinyblargon commented 4 months ago

@toybas

Isn't the purpose of the setting _skipipv6 = true to skip waiting for a ipv6 address and settle only with ipv4?

That's exactly what it's for yes.

Tinyblargon commented 4 months ago

@toybas I've made some more changes to #1016. There where 2 bugs that would cause skip_ipv4 & skip_ipv6 to be reverted back to false.

mwiget commented 4 months ago

Tried with https://github.com/Telmate/terraform-provider-proxmox/pull/1016 and while it works (thanks to reporting Warning instead of error), it still can't seem to find the IP's:

module.securemesh-single-nic[0].local_file.kubeconfig: Creation complete after 0s [id=6cd3d70e4fc7a16a5e4f70aee7ef577a4bce08bc]
╷
│ Warning: Qemu Guest Agent is enabled but no IPv6 address is found
│ 
│   with module.securemesh-single-nic[0].proxmox_vm_qemu.master-vm[0],
│   on proxmox/master_vm.tf line 1, in resource "proxmox_vm_qemu" "master-vm":
│    1: resource "proxmox_vm_qemu" "master-vm" {
│ 
│ Qemu Guest Agent is enabled in your configuration but no IPv6 address was found before the time ran out, increasing 'agent_timeout' could resolve this issue. To suppress this warning set
│ 'skip_ipv6' to true.
╵

Apply complete! Resources: 8 added, 0 changed, 0 destroyed.

despite Proxmox reporting IPv4 and IPv6 correctly:

IMG_1282

Tinyblargon commented 4 months ago

@mwiget, it gives different errors depending if it couldn't find ipv4, ipv6 , ipv4 & ipv6. It has found an ipv4 address, but no ipv6 all the ipv6 addresses in your screenshot are local-link addresses.

mwiget commented 4 months ago

@Tinyblargon vhost0 has a public IPv6: IMG_1282

Tinyblargon commented 4 months ago

@mwiget is the returned ipv4 address of the vhost0 interface? maybe it's trying to get an ip from docker0 or host-envoy0? Also what happens if you increase agent_timeout to 120 2 minutes?

mwiget commented 4 months ago

@Tinyblargon setting pm_timeout=120 didn't change anything. Same complaint about ipv6. Re IP, I see the ipv4 of vhost0 in tfstate file, which matches the IPv4 and macaddr from the UI. Interesting that it reports an invalid ipv6 (same as ipv4) in terraform:

$ grep address terraform.tfstate
            "default_ipv4_address": "192.168.42.19",
            "default_ipv6_address": "192.168.42.19",
            "address": null,

IMG_1284 IMG_1285

Tinyblargon commented 4 months ago

@mwiget could you set TF_LOG to DEBUG and give me the log message that contains interfaces found:? It should have all the found agent interfaces with all their properties after it.

export TF_LOG=DEBUG

Also fixed the bug where default_ipv4_address and default_ipv6_address have the same value. https://github.com/Telmate/terraform-provider-proxmox/pull/1016/commits/a3ccb5fbac878b7f6ee54164df6dfe4349e5dd25

mwiget commented 4 months ago

@Tinyblargon sure. I did apply https://github.com/Telmate/terraform-provider-proxmox/commit/a3ccb5fbac878b7f6ee54164df6dfe4349e5dd25 and redeployed with DEBUG:

tf-debug.log.gz

$ grep address terraform.tfstate
            "default_ipv4_address": "192.168.42.41",
            "default_ipv6_address": "",
mwiget commented 4 months ago

The log reports rpc error: code = Unavailable desc = error reading from server: EOF

Based on the debug log I fixed 2 errors in my TF (iso path and disk size) and rerun with debug set, but still see that rpc error, which explains why func getPrimaryIP() didn't retry. proxmox2.log.gz

  2024-05-30T07:26:52.228+0200 [INFO]  provider.terraform-provider-proxmox: 2024/05/30 07:26:52 [DEBUG] VM is running, checking the IP: timestamp=2024-05-30T07:26:52.228+0200          
  2024-05-30T07:26:52.228+0200 [INFO]  provider.terraform-provider-proxmox: 2024/05/30 07:26:52 [INFO][initConnInfo] trying to get vm ip address for provisioner: timestamp=2024-05-30T0  7:26:52.228+0200                                                                                                                                                                      
  2024-05-30T07:26:52.228+0200 [INFO]  provider.terraform-provider-proxmox: 2024/05/30 07:26:52 [DEBUG][initConnInfo] retrying for at most  20m0s minutes before giving up: timestamp=20  24-05-30T07:26:52.228+0200                                                                                                                                                            
  2024-05-30T07:26:52.228+0200 [INFO]  provider.terraform-provider-proxmox: 2024/05/30 07:26:52 [DEBUG][initConnInfo] retries will end at 2024-05-30 07:46:52.228541 +0200 CEST m=+1243.  576900334: timestamp=2024-05-30T07:26:52.228+0200                                                                                                                                     
  2024-05-30T07:27:13.666+0200 [INFO]  provider.terraform-provider-proxmox: 2024/05/30 07:27:13 [DEBUG][initConnInfo] this is the vm configuration: 192.168.42.54 22: timestamp=2024-05-  30T07:27:13.665+0200                                                                                                                                                                  
  2024-05-30T07:27:13.682+0200 [WARN]  Provider "provider[\"registry.terraform.io/telmate/proxmox\"]" produced an unexpected new value for module.securemesh-single-nic[0].proxmox_vm_q  mu.master-vm[0], but we are tolerating it because it is using the legacy plugin SDK.                                                                                                   
      The following problems may be the cause of any confusing errors from downstream operations:                                                                                       
        - .cipassword: inconsistent values for sensitive attribute                                                                                                                      
        - .startup: was null, but now cty.StringVal("")                                                                                                                                 
        - .ipconfig0: was null, but now cty.StringVal("")                                                                                                                               
        - .ipconfig15: was null, but now cty.StringVal("")                                                                                                                              
        - .ipconfig9: was null, but now cty.StringVal("")                                                                                                                               
        - .ipconfig13: was null, but now cty.StringVal("")                                                                                                                              
        - .ipconfig11: was null, but now cty.StringVal("")                                                                                                                              
        - .numa: was null, but now cty.False                                                                                                                                            
        - .ciuser: was null, but now cty.StringVal("")                                                                                                                                  
        - .args: was null, but now cty.StringVal("")                                                                                                                                    
        - .ipconfig6: was null, but now cty.StringVal("")                                                                                                                               
        - .machine: was null, but now cty.StringVal("")                                                                                                                                 
        - .ipconfig1: was null, but now cty.StringVal("")                                                                                                                               
        - .ipconfig7: was null, but now cty.StringVal("")                                                                                                                               
        - .sshkeys: was null, but now cty.StringVal("")                                                                                                                                 
        - .ipconfig12: was null, but now cty.StringVal("")                                                                                                                              
        - .ipconfig8: was null, but now cty.StringVal("")                                                                                                                               
        - .hagroup: was null, but now cty.StringVal("")                                                                                                                                 
        - .ipconfig4: was null, but now cty.StringVal("")                                                                                                                               
        - .ipconfig14: was null, but now cty.StringVal("")                                                                                                                              
        - .cicustom: was null, but now cty.StringVal("")                                                                                                                                
        - .hastate: was null, but now cty.StringVal("")                                                                                                                                 
        - .ipconfig10: was null, but now cty.StringVal("")                                                                                                                              
        - .ipconfig2: was null, but now cty.StringVal("")                                                                                                                               
        - .qemu_os: was null, but now cty.StringVal("other")                                                                                                                            
        - .ipconfig3: was null, but now cty.StringVal("")                                                                                                                               
        - .ipconfig5: was null, but now cty.StringVal("")                                                                                                                               
        - .desc: was null, but now cty.StringVal("")                                                                                                                                    
        - .network[0].mtu: was null, but now cty.NumberIntVal(0)                                                                                                                        
        - .disks[0].scsi[0].scsi0[0].disk[0].asyncio: was null, but now cty.StringVal("")                                                                                               
        - .disks[0].scsi[0].scsi0[0].disk[0].serial: was null, but now cty.StringVal("")                                                                                                
        - .disks[0].scsi[0].scsi0[0].disk[0].readonly: was null, but now cty.False                                                                                                      
        - .disks[0].scsi[0].scsi0[0].disk[0].wwn: was null, but now cty.StringVal("")                                                                                                   
        - .disks[0].scsi[0].scsi0[0].disk[0].cache: was null, but now cty.StringVal("")                                                                                                 
        - .disks[0].scsi[0].scsi0[0].disk[0].iothread: was null, but now cty.False                                                                                                      
        - .disks[0].scsi[0].scsi0[0].disk[0].discard: was null, but now cty.False                                                                                                       
        - .disks[0].scsi[0].scsi0[0].disk[0].emulatessd: was null, but now cty.False                                                                                                    
        - .disks[0].scsi[0].scsi0[0].disk[0].replicate: was null, but now cty.False                                                                                                     
        - .disks[0].ide[0].ide2[0].cdrom[0].passthrough: was null, but now cty.False                                                                                                    
  2024-05-30T07:27:13.696+0200 [DEBUG] provider.stdio: received EOF, stopping recv loop: err="rpc error: code = Unavailable desc = error reading from server: EOF"                      
  2024-05-30T07:27:13.697+0200 [DEBUG] provider: plugin process exited: path=/Users/mwiget/git/terraform-provider-proxmox/terraform-provider-proxmox pid=34013                          
  2024-05-30T07:27:13.703+0200 [DEBUG] provider: plugin exited                                                                                                                 
Tinyblargon commented 4 months ago

@mwiget I've found out the project uses 2 ways of debug logging of which one seems to be broken. I've added new logging so hopefully we get a result this time.

mwiget commented 4 months ago

@Tinyblargon using your latest https://github.com/Tinyblargon/terraform-provider-proxmox/tree/%231015 actually worked!

$ grep address terraform.tfstate
            "default_ipv4_address": "192.168.42.252",
            "default_ipv6_address": "2a02:168:5f5c:0:6c6c:68ff:fec6:90cb",

tf-proxmox.log.gz