cloudbase / cloudbase-init

Cross-platform instance initialization
http://openstack.org
Apache License 2.0
414 stars 150 forks source link

Exit 1001 from a userdata script makes the machine stuck in a reboot loop #80

Open rgl opened 2 years ago

rgl commented 2 years ago

I'm trying to install the Windows Containers feature and reboot the machine from userdata using this terraform snippet:

# a multipart cloudbase-init cloud-config.
# see https://github.com/cloudbase/cloudbase-init
# see https://cloudbase-init.readthedocs.io/en/latest/userdata.html#userdata
# see https://www.terraform.io/docs/providers/template/d/cloudinit_config.html
# see https://www.terraform.io/docs/configuration/expressions.html#string-literals
data "template_cloudinit_config" "example" {
  count = var.vm_count
  part {
    content_type = "text/cloud-config"
    content = <<-EOF
      #cloud-config
      hostname: ${var.vm_hostname_prefix}${count.index}
      timezone: Asia/Tbilisi
      EOF
  }
  part {
    filename = "install-windows-feature-containers.ps1"
    content_type = "text/x-shellscript"
    content = <<-EOF
      #ps1_sysnative
      Install-WindowsFeature Containers
      Exit 1001 # signal cloudbase-init to reboot.
      EOF
  }
}

From https://cloudbase-init.readthedocs.io/en/latest/tutorial.html#file-execution it seems I could return 1001 and have cloudbase-init reboot the machine and never execute that part again. It almost worked, it did install the feature with Install-WindowsFeature Containers and rebooted the machine, but, it kept stuck rebooting the machine.

I'm using cloudbase-init 1.1.2 in a Windows Server 2019 machine.

ader1990 commented 2 years ago

Hello,

As best practice, the userdata scripts need to be idempotent, which in your case, the install of containers is not. There has to be a check like this: if (isContainerFeatureInstalled) { do nothing } else { install; exit 1001;}

On the actual issue, can you share the logs? It seems that you are using a metadata service that does not save state?

Thank you, Adrian Vladu

rgl commented 2 years ago

Thank you! That was it! I've changed the code to:

      $result = Install-WindowsFeature Containers
      if ($result.RestartNeeded -eq 'Yes') {
        Exit 1001 # signal cloudbase-init to reboot.
      }

Here's the logs with the idempotent script:

cloudbase-init.log

If that's not enough for knowing the use metadata service, I'll gladly re-run the test again.

ader1990 commented 2 years ago

Hi,

According to the logs provided: DEBUG cloudbaseinit.init [-] Instance id: None configure_host -> without an instance ID, cloudbase-init cannot save the plugin state, thus it will run every plugin at every reboot.

rgl commented 2 years ago

Thank you for that good catch!

I'm now setting the metadata instance-id property, but for some reason I can no longer login into the machine, it somehow seems to have broke the part that handles the admin-username/admin-password metadata settings.

This is how I've currently have the terraform snippet:

# see https://registry.terraform.io/providers/hashicorp/random/latest/docs/resources/uuid
resource "random_uuid" "example" {
  count = var.vm_count
}

data "template_cloudinit_config" "example" {
  count = var.vm_count
...
}

resource "vsphere_virtual_machine" "example" {
  count = var.vm_count
  annotation = "instance-id: ${random_uuid.example[count.index].result}"
  # NB this extra_config data ends-up inside the VM .vmx file and will be
  #    exposed by cloudbase-init as a cloud-init datasource.
  extra_config = {
    "guestinfo.metadata" = base64gzip(jsonencode({
      # TODO why using instance-id seems to brake the "admin-username/admin-password"
      #      as I can no longer login into the machine?
      "instance-id": random_uuid.example[count.index].result,
      "admin-username": var.winrm_username,
      "admin-password": var.winrm_password,
      "public-keys-data": trimspace(file("~/.ssh/id_rsa.pub")),
    })),
    "guestinfo.metadata.encoding" = "gzip+base64",
    "guestinfo.userdata" = data.template_cloudinit_config.example[count.index].rendered,
    "guestinfo.userdata.encoding" = "gzip+base64"
  }
...
}

The full source-code is temporarily at https://github.com/rgl/terraform-vsphere-windows-example/blob/wip/main.tf#L276-L290.

By looking at the logs, nothing seems to have failed, so I'm lost.

Can you please have a look at the following logs?

cloudbase-init.log

Maybe is has to do with the following?

2021-09-07 18:59:21.604 1780 WARNING cloudbaseinit.plugins.common.setuserpassword [-] Using admin_pass metadata user password. Consider changing it as soon as possible
2021-09-07 18:59:21.651 1780 INFO cloudbaseinit.plugins.common.setuserpassword [-] Password succesfully updated for user Administrator
2021-09-07 18:59:21.651 1780 INFO cloudbaseinit.plugins.common.setuserpassword [-] Cannot set the password in the metadata as it is not supported by this service

Thou, I'm not sure why its uses the username Administrator as the configured username in the admin-username metadata property is vagrant.