canonical / cloud-init

Official upstream for the cloud-init: cloud instance initialization
https://cloud-init.io/
Other
2.99k stars 881 forks source link

Rebooting VM during packer build causes cloud-init to run instance modules again #4359

Open shaneholder opened 1 year ago

shaneholder commented 1 year ago

Bug report

I am attempting to use Packer to create a golden image that mounts /var as a partition. In the Packer build data is copied from several folders to the mounted alt_ folders and then a sed command is executed against fstab to replace the /alt_ with a blank to make the real mount points. The VM is then rebooted before continuing with the rest of the golden image configuration, however after the boot the changes made to fstab are replace by cloud-init. Log files are attached.

Packer build section

build {
  sources = ["source.azure-arm.linux-image"]

  provisioner "file" {
    source = "${local.scripts_folder}/100_golden.cfg"
    destination = "/tmp/"
  }

  provisioner "shell" {
    inline = ["sudo mv /tmp/100_golden.cfg /etc/cloud/cloud.cfg.d/"]
  }

  # copy var to /alt_var
  provisioner "shell" {
    inline = [
      "sudo sh -c 'cd /var/; tar cf - . | (cd /alt_var; tar xf -)'",
      "sudo sh -c 'cd /home/; tar cf - . | (cd /alt_home; tar xf -)'",
      "sudo sh -c 'cd /tmp/; tar cf - . | (cd /alt_tmp; tar xf -)'"      
    ]
  }

  # reboot vm to mount /var
  provisioner "shell" {
    inline = [
      "sudo sed -i 's#/alt_#/#' /etc/fstab",
    ]
  }

  # reboot vm to mount /var
  provisioner "shell" {
    inline = [
      "sudo reboot"
    ]
    expect_disconnect = true
    pause_after = "60s"
  }
}

Cloud Config

#cloud-config

disk_setup:
  /dev/disk/azure/scsi1/lun0:
    table_type: gpt
    layout: 
      - 15
      - 14
      - 28
      - 14
      - 14
      - 14
    overwrite: True

fs_setup:
  - device: /dev/disk/azure/scsi1/lun0
    partition: 1
    filesystem: ext4
  - device: /dev/disk/azure/scsi1/lun0
    partition: 2
    filesystem: ext4
  - device: /dev/disk/azure/scsi1/lun0
    partition: 3
    filesystem: ext4
  - device: /dev/disk/azure/scsi1/lun0
    partition: 4
    filesystem: ext4
  - device: /dev/disk/azure/scsi1/lun0
    partition: 5
    filesystem: ext4
  - device: /dev/disk/azure/scsi1/lun0
    partition: 6
    filesystem: ext4

mounts:
 - [ "/dev/disk/azure/scsi1/lun0-part1", "/alt_home", "auto", "defaults,nofail", "0", "2" ]
 - [ "/dev/disk/azure/scsi1/lun0-part2", "/alt_tmp", "auto", "defaults,nofail", "0", "2" ]
 - [ "/dev/disk/azure/scsi1/lun0-part3", "/alt_var", "auto", "defaults,nofail", "0", "2" ]
 - [ "/dev/disk/azure/scsi1/lun0-part4", "/alt_var/log", "auto", "defaults,nofail,x-mount.mkdir", "0", "2" ]
 - [ "/dev/disk/azure/scsi1/lun0-part5", "/alt_var/log/audit", "auto", "defaults,nofail,x-mount.mkdir", "0", "2" ]
 - [ "/dev/disk/azure/scsi1/lun0-part6", "/alt_var/tmp", "auto", "defaults,nofail,x-mount.mkdir", "0", "2" ]

Steps to reproduce the problem

Environment details

cloud-init logs

cloud-init.tar.gz

shaneholder commented 1 year ago

Adding the fstab in question.

# CLOUD_IMG: This file was created/modified by the Cloud Image build process
UUID=10657b0e-bcdb-4859-aeaf-31f0fb6dbde6       /        ext4   defaults,discard        0 1
UUID=CD18-78B8  /boot/efi       vfat    umask=0077      0 1
/dev/disk/azure/scsi1/lun0-part1        /alt_home       auto    defaults,nofail,comment=cloudconfig     0       2
/dev/disk/azure/scsi1/lun0-part2        /alt_tmp        auto    defaults,nofail,comment=cloudconfig     0       2
/dev/disk/azure/scsi1/lun0-part3        /alt_var        auto    defaults,nofail,comment=cloudconfig     0       2
/dev/disk/azure/scsi1/lun0-part4        /alt_var/log    auto    defaults,nofail,x-mount.mkdir,comment=cloudconfig       0       2
/dev/disk/azure/scsi1/lun0-part5        /alt_var/log/audit      auto    defaults,nofail,x-mount.mkdir,comment=cloudconfig       0       2
/dev/disk/azure/scsi1/lun0-part6        /alt_var/tmp    auto    defaults,nofail,x-mount.mkdir,comment=cloudconfig       0       2
/dev/disk/cloud/azure_resource-part1    /mnt    auto    defaults,nofail,x-systemd.requires=cloud-init.service,_netdev,comment=cloudconfig       0       2

And the 100_golden.cfg referenced in the packer instructions. This is only expected to be used when a VM is created using this golden image.

#cloud-config

mounts:
 - [ "/dev/disk/azure/scsi1/lun0-part1", "/home", "auto", "defaults,nofail", "0", "2" ]
 - [ "/dev/disk/azure/scsi1/lun0-part2", "/tmp", "auto", "defaults,nofail", "0", "2" ]
 - [ "/dev/disk/azure/scsi1/lun0-part3", "/var", "auto", "defaults,nofail", "0", "2" ]
 - [ "/dev/disk/azure/scsi1/lun0-part4", "/var/log", "auto", "defaults,nofail", "0", "2" ]
 - [ "/dev/disk/azure/scsi1/lun0-part5", "/var/log/audit", "auto", "defaults,nofail", "0", "2" ]
 - [ "/dev/disk/azure/scsi1/lun0-part6", "/var/tmp", "auto", "defaults,nofail", "0", "2" ]
igalic commented 1 year ago

you could touch /etc/cloud/cloud-init.disabled before packer reboot, and remove it afterwards

shaneholder commented 1 year ago

you could touch /etc/cloud/cloud-init.disabled before packer reboot, and remove it afterwards

Thanks for the idea but cloud init still ran after the reboot..srsly odd.

aciba90 commented 1 year ago

Thanks for the idea but cloud init still ran after the reboot..srsly odd.

That's weird. Are we sure the file is being created before rebooting? Could you please attach the logs of your run touching cloud-init.disabled?

I believe the reason for cloud-init fully running after the reboot is because the packer config mounts an alternative var partition, losing the original /var/lib/cloud/, which contains the cloud-init state of the first run. If we want cloud-init to run modules that run on every boot and, we should transfer somehow /alt_var/lib/cloud/ to /var/lib/cloud.

Another option would be to have two config files. 100_golden.cfg as-is and a supplemental 101_final.cfg containing the mounts config without alt., then remove 100_golden.cfg before reboot using packer or runcmd, please note that without restoring /var/lib/cloud, cloud-init would treat this as a first boot.

In any case, I believe this is not a cloud-init issue.

shaneholder commented 1 year ago

@aciba90 thanks for the reply. I assume you're referring to my var manipulation in packer. I tried several configurations even one where i don't manipulate /var thinking something along the same lines as you but I still get the same behavior. While working this weekend I also discovered that I needed to wait for cloud init to complete using cloud-init status --wait. Even with that it seems that cloud-init is re-running. I've attached new logs and updated configuration files here.

Note the line in the cloud-init.log 2023-08-21 13:38:55,813 - cc_mounts.py[DEBUG]: Changes to fstab: ['- /dev/disk/azure/scsi1/lun0-part1 /home auto defaults,nofail,comment=cloudconfig 0 2', '+ /dev/disk/azure/scsi1/lun0-part1 /alt_home auto defaults,nofail,comment=cloudconfig 0 2']

cloud-init.tar.gz

Packer config


locals {
  scripts_folder = "${path.cwd}/Linux/scripts"
  compliance_folder = "${path.cwd}/compliance/ubuntu2004"
}

source "azure-arm" "linux-image" {
  azure_tags = {
    OS = var.OS
    Version = var.OS_version
  }
  use_azure_cli_auth  = true
  image_offer         = var.image_offer
  image_publisher     = var.image_publisher
  image_sku           = var.image_sku
  shared_image_gallery_destination {
    subscription          = var.subscription_id
    resource_group        = var.sig_rg_name
    gallery_name          = var.gallery_name
    image_name            = var.image_name
    image_version         = var.image_version
    replication_regions   = var.replication_regions
    storage_account_type  = var.storage_account_type
  } 
  build_resource_group_name               = var.packer_rg_name
  virtual_network_name                    = var.vnet_name
  virtual_network_subnet_name             = var.subnet_name
  virtual_network_resource_group_name     = var.vnet_rg_name
  private_virtual_network_with_public_ip  = true
  managed_image_name                      = "${var.image_name}-${var.image_version}"
  managed_image_resource_group_name       = var.image_rg_name
  os_type                                 = var.os_type
  vm_size                                 = var.vm_size
  disk_additional_size                    = [35] #/home(5) /tmp(5) /var(10) /var/log(5) /var/log/audit(5) /var/tmp(5)
  custom_data_file                        = "${local.scripts_folder}/cloud.cfg"
  ssh_private_key_file                    = "~/.ssh/id_rsa"
}

build {
  sources = ["source.azure-arm.linux-image"]

  # Wait for cloud init to complete
  provisioner "shell" {
    execute_command = "chmod +x {{ .Path }}; {{ .Vars }} sudo -E sh '{{ .Path }}'"
    inline = [
      "cloud-init status --wait"
    ]
  }

  # copy var to /alt_var
  provisioner "shell" {
    remote_folder = "/home/packer"
    execute_command = "chmod +x {{ .Path }}; {{ .Vars }} sudo -E sh '{{ .Path }}'"
    inline = [       
      "cd /home/; tar --selinux --acls --xattrs -cf - . | (cd /alt_home; tar xf -)"
    ]
  }

  # reboot vm to mount /var
  provisioner "shell" {
    remote_folder = "/home/packer"
    inline = [
      "sudo sed -i 's#/alt_#/#' /etc/fstab",
    ]
  }

  provisioner "breakpoint" {
    disable = false
    note    = "reboot next"
  }

  # reboot vm to mount /var
  provisioner "shell" {
    remote_folder = "/home/packer"
    inline = [
      "sudo reboot"
    ]
    expect_disconnect = true
    pause_after = "60s"
  }

  provisioner "breakpoint" {
    disable = false
    note    = "post reboot"
  }
}

cloud-init config

#cloud-config

disk_setup:
  /dev/disk/azure/scsi1/lun0:
    table_type: gpt
    layout: 
      - 15
      - 14
      - 28
      - 14
      - 14
      - 14
    overwrite: True

fs_setup:
  - device: /dev/disk/azure/scsi1/lun0
    partition: 1
    filesystem: ext4
  - device: /dev/disk/azure/scsi1/lun0
    partition: 2
    filesystem: ext4
  - device: /dev/disk/azure/scsi1/lun0
    partition: 3
    filesystem: ext4
  - device: /dev/disk/azure/scsi1/lun0
    partition: 4
    filesystem: ext4
  - device: /dev/disk/azure/scsi1/lun0
    partition: 5
    filesystem: ext4
  - device: /dev/disk/azure/scsi1/lun0
    partition: 6
    filesystem: ext4

mounts:
 - [ "/dev/disk/azure/scsi1/lun0-part1", "/alt_home", "auto", "defaults,nofail", "0", "2" ]

package_update: true

packages:
 - vim
 - less
aciba90 commented 1 year ago

In your packer script, could you please add "touch /etc/cloud/cloud-init.disabled"[1] after "cloud-init status --wait" ?

That will disable cloud-init running on the second boot. And you will see a line with etc_file found disabled in /run/cloud-init/cloud-init-generator.log. (That line is not present in your logs.)

Doing that, the created image will never run cloud-init again, until the disable file is removed, and then your problem will reappear.

If one wants images where cloud-init runs, I think the user-data for the final instance must be modified before reboot.

[1] https://canonical-cloud-init.readthedocs-hosted.com/en/latest/howto/disable_cloud_init.html#method-1-text-file

shaneholder commented 1 year ago

@aciba90 thanks I will give that a try again, FWIW this is on Ubuntu and I thought that I saw code that only used that file for RHEL but I could be mistaken. While I'm trying that, any thoughts as to why cloud-init is running after the reboot in the first place?

shaneholder commented 1 year ago

Hi, same problem cloud init re-runs on the reboot. I am not expecting the cloud-init execution at 19:52.

root@pkrvmt1dezfw0mb:/var/log# grep lun0-part1 cloud-init.log 
2023-08-28 19:50:54,523 - cc_mounts.py[DEBUG]: mounts configuration is [['/dev/disk/azure/scsi1/lun0-part1', '/alt_home', 'auto', 'defaults,nofail', '0', '2']]
2023-08-28 19:50:54,524 - cc_mounts.py[DEBUG]: Attempting to determine the real name of /dev/disk/azure/scsi1/lun0-part1
2023-08-28 19:50:54,524 - cc_mounts.py[DEBUG]: changed /dev/disk/azure/scsi1/lun0-part1 => None
2023-08-28 19:50:54,524 - cc_mounts.py[DEBUG]: Ignoring nonexistent named mount /dev/disk/azure/scsi1/lun0-part1
2023-08-28 19:50:54,526 - cc_mounts.py[DEBUG]: Changes to fstab: ['+ /dev/disk/azure/scsi1/lun0-part1 /alt_home auto defaults,nofail,comment=cloudconfig 0 2', '+ /dev/disk/cloud/azure_resource-part1 /mnt auto defaults,nofail,x-systemd.requires=cloud-init.service,_netdev,comment=cloudconfig 0 2']
2023-08-28 19:52:43,840 - cc_mounts.py[DEBUG]: mounts configuration is [['/dev/disk/azure/scsi1/lun0-part1', '/alt_home', 'auto', 'defaults,nofail', '0', '2']]
2023-08-28 19:52:43,841 - cc_mounts.py[DEBUG]: Attempting to determine the real name of /dev/disk/azure/scsi1/lun0-part1
2023-08-28 19:52:43,841 - cc_mounts.py[DEBUG]: changed /dev/disk/azure/scsi1/lun0-part1 => None
2023-08-28 19:52:43,841 - cc_mounts.py[DEBUG]: Ignoring nonexistent named mount /dev/disk/azure/scsi1/lun0-part1
2023-08-28 19:52:43,842 - cc_mounts.py[DEBUG]: Changes to fstab: ['- /dev/disk/azure/scsi1/lun0-part1 /home auto defaults,nofail,comment=cloudconfig 0 2', '+ /dev/disk/azure/scsi1/lun0-part1 /alt_home auto defaults,nofail,comment=cloudconfig 0 2']

cloud-init.tar.gz

holmanb commented 1 year ago

@aciba90 thanks I will give that a try again, FWIW this is on Ubuntu and I thought that I saw code that only used that file for RHEL but I could be mistaken.

The existence of this file disables cloud-init in all of the systemd service files, not just a rhel thing.

While I'm trying that, any thoughts as to why cloud-init is running after the reboot in the first place?

something is starting cloud-init? I see that init and init-local stages are re-run, but not sure why. If systemd isn't observing this file in your image, then you have a systemd bug. If something else is starting cloud-init externally, you'll need to figure out what that is. Either way, I agree with @aciba90's analysis - I don't see a cloud-init bug here.

Aside: You are providing invalid cloud-config to cloud-init, and there are other warnings in your logs so cloud-init is likely not doing what you want it to. For example, mount -a fails due to unknown filesystem type ntfs.

shaneholder commented 1 year ago

Ok, I'll look into that. It's odd because we're using the Canonical minimal image in Azure.

image_publisher = "Canonical" image_offer = "0001-com-ubuntu-minimal-focal" image_sku = "minimal-20_04-lts-gen2"

And looking at the file you pointed out is where I got the idea that cloud-init.disabled was for rhel only. Line 29 is a conditional for variant=="rhel" so I assumed that was RedHat.

raharper commented 1 year ago

Ok, I'll look into that. It's odd because we're using the Canonical minimal image in Azure.

image_publisher = "Canonical" image_offer = "0001-com-ubuntu-minimal-focal" image_sku = "minimal-20_04-lts-gen2"

And looking at the file you pointed out is where I got the idea that cloud-init.disabled was for rhel only. Line 29 is a conditional for variant=="rhel" so I assumed that was RedHat.

Indeed it is, the current version of the template only emits that for 'rhel' distros. That's OK. What actually controls whether cloud-init runs or not is the systemd-generator[1] along with the ds-identify[2] tool, which searches for cloud-init policy[3] to determine what to do if any cloud-init datasources are found.

  1. https://github.com/canonical/cloud-init/blob/main/systemd/cloud-init-generator.tmpl
  2. https://github.com/canonical/cloud-init/blob/main/tools/ds-identify
  3. https://github.com/canonical/cloud-init/blob/main/tools/ds-identify#L12

The log file run/cloud-init/ds-identify.log from your debug attachment shows that the policy inside the Ubuntu image you used was this:

policy loaded: mode=search report=false found=all maybe=all notfound=disabled
/etc/cloud/cloud.cfg.d/90_dpkg.cfg set datasource_list: [ Azure ]
...
ON_FOUND=all
ON_MAYBE=all
ON_NOTFOUND=disabled
...
single entry in datasource_list (Azure) use that.

In particular, mode=search,found=all in the policy means to search for datasources, and run any datasources that are found; and in this case, it found Azure datasource and so by policy, it enables cloud-init; which means cloud-init.target is wanted by the default system target and cloud-init will always run, independent of what's in /etc/cloud-init/*

To resolve your issue of disabling cloud-init on Azure, you should be able to write out a disabled policy file like so:

cat > /etc/cloud/ds-identify.cfg <<EOF
policy: disabled,found=first,maybe=none,notfound=disabled
EOF

This will prevent cloud-init.target from being added to systemd's default target and none of the cloud-init service unit will be run.

I don't have an azure instance right now; but testing on a LXD system that booted via cloud-init, adding that file results in the following:

% lxc exec j1 /bin/bash
root@j1:~# lsb_release -rd
Description:    Ubuntu 22.04.1 LTS
Release:    22.04
root@j1:~# systemctl status cloud-init.target
○ cloud-init.target - Cloud-init target
     Loaded: loaded (/lib/systemd/system/cloud-init.target; static)
     Active: inactive (dead)
root@j1:~# cat /run/cloud-init/ds-identify.log
[up 0.25s] ds-identify
policy loaded: mode=disabled report=false found=first maybe=none notfound=disabled
/etc/cloud/cloud.cfg.d/90_dpkg.cfg set datasource_list: [ NoCloud, ConfigDrive, OpenNebula, DigitalOcean, Azure, AltCloud, OVF, MAAS, GCE, OpenStack, CloudSigma, SmartOS, Bigstep, Scaleway, AliYun, Ec2, CloudStack, Hetzner, IBMCloud, Oracle, Exoscale, RbxCloud, UpCloud, VMware, Vultr, LXD, NWCS, None ]
DMI_PRODUCT_NAME=System Product Name
DMI_SYS_VENDOR=ASUS
DMI_PRODUCT_SERIAL=unavailable
DMI_PRODUCT_UUID=unavailable
PID_1_PRODUCT_NAME=unavailable
DMI_CHASSIS_ASSET_TAG=Default string
DMI_BOARD_NAME=PRIME TRX40-PRO
FS_LABELS=unavailable:container
ISO9660_DEVS=unavailable:container
KERNEL_CMDLINE=/sbin/init
VIRT=lxc
UNAME_KERNEL_NAME=Linux
UNAME_KERNEL_RELEASE=5.15.0-78-generic
UNAME_KERNEL_VERSION=#85~20.04.1-Ubuntu SMP Mon Jul 17 09:42:39 UTC 2023
UNAME_MACHINE=x86_64
UNAME_NODENAME=j1
UNAME_OPERATING_SYSTEM=GNU/Linux
DSNAME=
DSLIST=NoCloud ConfigDrive OpenNebula DigitalOcean Azure AltCloud OVF MAAS GCE OpenStack CloudSigma SmartOS Bigstep Scaleway AliYun Ec2 CloudStack Hetzner IBMCloud Oracle Exoscale RbxCloud UpCloud VMware Vultr LXD NWCS None
MODE=disabled
ON_FOUND=first
ON_MAYBE=none
ON_NOTFOUND=disabled
pid=45 ppid=24
is_container=true
mode=disabled. returning 1
[up 0.28s] returning 1
root@j1:~# cloud-init status --long
status: disabled
boot_status_code: disabled-by-generator
detail:
Cloud-init disabled by cloud-init-generator

You can simply remove this policy file to re-enable cloud-init in that image.

shaneholder commented 1 year ago

I'm starting to think cloud-init just doesn't like me. :) Still not the results I'm expecting.

ds-identify.cfg file

packer@pkrvm81fdkczhiy:/var/log$ cat /etc/cloud/ds-identify.cfg
policy: disabled,found=first,maybe=none,notfound=disabled

check cloud-init status

packer@pkrvm81fdkczhiy:/var/log$ cloud-init status --long
status: running
boot_status_code: disabled-by-generator
last_update: Wed, 30 Aug 2023 21:22:33 +0000
detail:
DataSourceAzure [seed=/var/lib/waagent]

grep for cloud-init modifying fstab the change at 21:22:33,556 is unexpected

packer@pkrvm81fdkczhiy:/var/log$ grep alt_home cloud-init.log 
2023-08-30 21:20:47,722 - cc_mounts.py[DEBUG]: mounts configuration is [['/dev/disk/azure/scsi1/lun0-part1', '/alt_home', 'auto', 'defaults,nofail', '0', '2']]
2023-08-30 21:20:47,725 - cc_mounts.py[DEBUG]: Changes to fstab: ['+ /dev/disk/azure/scsi1/lun0-part1 /alt_home auto defaults,nofail,comment=cloudconfig 0 2', '+ /dev/disk/cloud/azure_resource-part1 /mnt auto defaults,nofail,x-systemd.requires=cloud-init.service,_netdev,comment=cloudconfig 0 2']
2023-08-30 21:22:33,554 - cc_mounts.py[DEBUG]: mounts configuration is [['/dev/disk/azure/scsi1/lun0-part1', '/alt_home', 'auto', 'defaults,nofail', '0', '2']]
2023-08-30 21:22:33,556 - cc_mounts.py[DEBUG]: Changes to fstab: ['- /dev/disk/azure/scsi1/lun0-part1 /home auto defaults,nofail,comment=cloudconfig 0 2', '+ /dev/disk/azure/scsi1/lun0-part1 /alt_home auto defaults,nofail,comment=cloudconfig 0 2']

verify that fstab was modified back to /alt_home

packer@pkrvm81fdkczhiy:/var/log$ grep home /etc/fstab
/dev/disk/azure/scsi1/lun0-part1        /alt_home       auto    defaults,nofail,comment=cloudconfig     0       2
packer@pkrvm81fdkczhiy:/var/log$ 

An additional oddity that I noticed is that it seems that cloud-init status --wait never exits after the reboot.

tail the log file

2023-08-30 21:22:33,914 - util.py[DEBUG]: Reading from /proc/uptime (quiet=False)
2023-08-30 21:22:33,915 - util.py[DEBUG]: Read 10 bytes from /proc/uptime
2023-08-30 21:22:33,915 - util.py[DEBUG]: cloud-init mode 'init' took 1.303 seconds (1.30)
2023-08-30 21:22:33,915 - handlers.py[DEBUG]: finish: init-network: SUCCESS: searching for network datasources
2023-08-30 21:22:33,915 - handlers.py[DEBUG]: HyperVReportingHandler flushing remaining events

print current date time

packer@pkrvm81fdkczhiy:/var/log$ date
Wed Aug 30 21:33:18 UTC 2023

wait some more

packer@pkrvm81fdkczhiy:/var/log$ cloud-init status --wait
.........................................................................................................................................................................................................................................................................................................^CFailed to write to /dev/console
Cloud-init 23.2.2-0ubuntu0~20.04.1 received SIGINT, exiting...
  Filename: /usr/lib/python3/dist-packages/cloudinit/cmd/status.py
  Function: handle_status_args
  Line number: 127
    Filename: /usr/lib/python3/dist-packages/cloudinit/util.py
    Function: log_time
    Line number: 2745
      Filename: /usr/lib/python3/dist-packages/cloudinit/cmd/main.py
      Function: main
      Line number: 1080

tail the log again

packer@pkrvm81fdkczhiy:/var/log$ tail -f cloud-init.log 
2023-08-30 21:22:33,913 - handlers.py[DEBUG]: start: init-network/config-ssh: running config-ssh with frequency once-per-instance
2023-08-30 21:22:33,913 - helpers.py[DEBUG]: config-ssh already ran (freq=once-per-instance)
2023-08-30 21:22:33,913 - handlers.py[DEBUG]: finish: init-network/config-ssh: SUCCESS: config-ssh previously ran
2023-08-30 21:22:33,913 - main.py[DEBUG]: Ran 10 modules with 0 failures
2023-08-30 21:22:33,914 - atomic_helper.py[DEBUG]: Atomically writing to file /var/lib/cloud/data/status.json (via temporary file /var/lib/cloud/data/tmp4z0dj5lo) - w: [644] 540 bytes/chars
2023-08-30 21:22:33,914 - util.py[DEBUG]: Reading from /proc/uptime (quiet=False)
2023-08-30 21:22:33,915 - util.py[DEBUG]: Read 10 bytes from /proc/uptime
2023-08-30 21:22:33,915 - util.py[DEBUG]: cloud-init mode 'init' took 1.303 seconds (1.30)
2023-08-30 21:22:33,915 - handlers.py[DEBUG]: finish: init-network: SUCCESS: searching for network datasources
2023-08-30 21:22:33,915 - handlers.py[DEBUG]: HyperVReportingHandler flushing remaining events

still no update to log file

packer@pkrvm81fdkczhiy:/var/log$ date
Wed Aug 30 21:34:52 UTC 2023
raharper commented 1 year ago

Thanks for testing.

$ cloud-init status --long status: running boot_status_code: disabled-by-generator last_update: Wed, 30 Aug 2023 21:22:33 +0000

This shows that cloud-init was successfully disabled and was not run by systemd. Whatever else is going on, cloud-init is not misbehaving here.

If cloud-init is running but was disabled by the generator, then something else besides systemd started the service. I believe someone else pointed out that some other tool may be invoking cloud-init commands directly.

If you want to find out what is doing that, you can move /usr/bin/cloud-init out of the way (rename it) ; and whatever script or service that's invoking cloud-init should break/stack-trace and you can find the origin of your pain.

A complete dump of: journalctl -b 0 -o short-precise will be helpful to packer support in diagnosing what is invoking cloud-init directly.

shaneholder commented 1 year ago

Thanks for the troubleshooting tips. I've done that and before I submit anything to the packer team I wanted a quick review of the below.

I'm using the same code as before but I added a sudo mv /usr/bin/cloud-init /usr/bin/cloud-not-init right before I reboot the VM. the line @ 442136, is that in opposition to the statement above that "something else besides systemd started the service?"

Trying to make sure I'm understanding the expectations of the process.

packer@pkrvmgvzux857cz:~$ journalctl -b 0 -o short-precise | grep -i cloud-init
Sep 05 19:36:11.442095 pkrvmgvzux857cz systemd[452]: cloud-init-local.service: Failed to execute command: No such file or directory
Sep 05 19:36:11.442136 pkrvmgvzux857cz systemd[452]: cloud-init-local.service: Failed at step EXEC spawning /usr/bin/cloud-init: No such file or directory
Sep 05 19:36:11.442607 pkrvmgvzux857cz systemd[1]: Starting Initial cloud-init job (pre-networking)...
Sep 05 19:36:11.461169 pkrvmgvzux857cz systemd[1]: cloud-init-local.service: Main process exited, code=exited, status=203/EXEC
Sep 05 19:36:11.461321 pkrvmgvzux857cz systemd[1]: cloud-init-local.service: Failed with result 'exit-code'.
Sep 05 19:36:11.462046 pkrvmgvzux857cz systemd[1]: Failed to start Initial cloud-init job (pre-networking).
Sep 05 19:36:14.241921 pkrvmgvzux857cz systemd[1]: Starting Initial cloud-init job (metadata service crawler)...
Sep 05 19:36:14.245319 pkrvmgvzux857cz systemd[475]: cloud-init.service: Failed to execute command: No such file or directory
Sep 05 19:36:14.245363 pkrvmgvzux857cz systemd[475]: cloud-init.service: Failed at step EXEC spawning /usr/bin/cloud-init: No such file or directory
Sep 05 19:36:14.245828 pkrvmgvzux857cz systemd[1]: cloud-init.service: Main process exited, code=exited, status=203/EXEC
Sep 05 19:36:14.245959 pkrvmgvzux857cz systemd[1]: cloud-init.service: Failed with result 'exit-code'.
Sep 05 19:36:14.246326 pkrvmgvzux857cz systemd[1]: Failed to start Initial cloud-init job (metadata service crawler).
Sep 05 19:36:14.890478 pkrvmgvzux857cz python3[501]: 2023-09-05T19:36:14.889224Z INFO Daemon cloud-init is enabled: True
Sep 05 19:36:14.892794 pkrvmgvzux857cz python3[501]: 2023-09-05T19:36:14.891726Z INFO Daemon Using cloud-init for provisioning
raharper commented 1 year ago

Thanks for the troubleshooting tips. I've done that and before I submit anything to the packer team I wanted a quick review of the below.

I'm using the same code as before but I added a sudo mv /usr/bin/cloud-init /usr/bin/cloud-not-init right before I reboot the VM. the line @ 442136, is that in opposition to the statement above that "something else besides systemd started the service?"

I've not used packer, so I'm not sure what's happening during a packer VM execution, but from what I understand there are at least these 3 things happening (please correct me where I'm wrong)

1) first boot of some cloud-image on azure (is this unmodified Ubuntu or has packer already made changes?) 2) (1) reads the cloud-init configuration from the metadata service in azure and runs that code

3) at somepoint during (2) a reboot is issued.

Before the reboot I would

1) invoke cloud-init collect-logs to capture current state of cloud-init saving the tarball in some place you can get it after the reboot. 2) inject the ds-identify policy to disable cloud-init 3) move /usr/bin/cloud-init out of the way 4) optionally enable verbose systemd debugging (adding "systemd.log_level=debug systemd.log_target=console systemd.journald.forward_to_console=1" to kernel cmdline)

After reboot

1) invoke cloud-init collect-logs to capture the new state of the system.

Comparing the collected logs from before and after reboot should shed some light on what's going on.

  1. systemctl status <unit> on units cloud-init.target cloud-init-local.service cloud-init.service

This hopefully will show them as inactive, though it seem in your case they've been added to some systemd target.

  1. systemd-analyze dot cloud-init.target to show what services/targets "want" cloud-init.target

  2. find /run/systemd/generator* -name 'cloud-init.target' to see if a generator enabled cloud-init

Trying to make sure I'm understanding the expectations of the process.

The process is mostly capturing enough information about the state the system is in currently to understand why systemd starts cloud-init even if cloud-init does not generate a link telling systemd to start it up.

The ds-identify activation from cloud-init results in creating a symlink here:

root@j1:/run/systemd/generator.early/multi-user.target.wants# ls -al
total 0
drwxr-xr-x 2 root root 60 Sep  5 21:47 .
drwxr-xr-x 3 root root 60 Sep  5 21:47 ..
lrwxrwxrwx 1 root root 37 Sep  5 21:47 cloud-init.target -> /lib/systemd/system/cloud-init.target

Lastly, if you cannot find out what is enabling cloud-init units, then the big hammer is to mask the units.

$ systemctl mask cloud-init.target cloud-init-local.service cloud-init.service cloud-config.service cloud-final.service

Ideally just .target is enough but masking all of them will ensure systemd will no execute them.

root@j1:~# systemctl mask cloud-init.target
Created symlink /etc/systemd/system/cloud-init.target → /dev/null.
root@j1:~# reboot
root@j1:~#
     ..:(╯°□°)╯彡┻━━┻
% lxc exec j1 /bin/bash
root@j1:~# cloud-init status --long
status: not run
boot_status_code: enabled-by-generator
detail:
Cloud-init enabled by systemd cloud-init-generator
root@j1:~# systemctl status cloud-init.target
○ cloud-init.target
     Loaded: masked (Reason: Unit cloud-init.target is masked.)
     Active: inactive (dead)

To undo, just run systemd unmask on each unit you masked.