canonical / cloud-init

Official upstream for the cloud-init: cloud instance initialization
https://cloud-init.io/
Other
2.97k stars 880 forks source link

cloud-config script not executing: cloud-final.service not reached #3946

Closed ubuntu-server-builder closed 1 year ago

ubuntu-server-builder commented 1 year ago

This bug was originally filed in Launchpad as LP: #1960678

Launchpad details
affected_projects = []
assignee = None
assignee_name = None
date_closed = 2022-04-18T04:17:21.343310+00:00
date_created = 2022-02-11T21:17:27.388213+00:00
date_fix_committed = None
date_fix_released = None
id = 1960678
importance = undecided
is_complete = True
lp_url = https://bugs.launchpad.net/cloud-init/+bug/1960678
milestone = None
owner = colin.williams.seattle
owner_name = Colin Williams
private = False
status = expired
submitter = colin.williams.seattle
submitter_name = Colin Williams
tags = []
duplicates = []

Launchpad user Colin Williams(colin.williams.seattle) wrote on 2022-02-11T21:17:27.388213+00:00

From the mailing list archive: https://lists.launchpad.net/cloud-init/msg00385.html

Here's the debugging information based on Chads' response which may not have been delivered to the list. Note that I have to type out the details due to firewall and network details. It looks like everything is ok based on described. I do have to run the script as root for it to succeed. However the first command to write the dummy file doesn't require root.

To debug further where should I specifically look for a log that captures the run command script execution after Shellification?

Is there a way to capture any errors or even the script run itself?


  1. Yes

cloud-config

runcmd:

output : {all : '| tee -a /var/log/cloud-init-output.log'}

  1. Valid cloud-config: system userdata
  2. status: running time: Wed, 09 Feb 2022 21:46:00 +0000 detail: DataSourceEc2

    when I look at /var/lib/cloud-init.log I see relevant lines that tell me

    util.py start: modules-config/config-runcmd: running config--runcmd with frequency once-per-instance util.py Shellified 9 commands Writing to /var/lib/cloud/instances/.... main.py Ran 14 modules with 0 failures

    Don't see any errors

  3. when I manually try to inoke the script without sudo -bash : /var/lib/path/to/instance/run-cmd : Permission denied

I checked the ownership. The file is rwx owner, and owned by root root

When I execute the script at /var/lib/path/to/instance/run-cmd using sudo, it runs as expected.

On Wed, Feb 9, 2022 at 10:31 PM Colin Williams colin.williams.seattle@xxxxxxxxx wrote:

I made a look at /var/lib/cloud/instance/user-data.txt and it appears that the cloud config script is deployed.

On Wed, Feb 9, 2022 at 10:30 PM Colin Williams colin.williams.seattle@xxxxxxxxx wrote:

Hi,

I am trying to mount a instance block device (non EBS) similar to

https://stackoverflow.com/a/67723667

With the hostname parameter in cloud config I noticed an error ( I don't recall the exact location where I saw this)

I removed the hostname parameter. In the runcmd section I added a line to write a file to /tmp/cc to try to see if the script is executed. - echo "" > /tmp/cc

I am not seeing any sign of execution.

I made a look at /var/lib/cloud/instance/user-data.txt and it appears that the cloud config script is executed.

There appears to be a systemd cloud-init.service enabled. When I look at the logs it appears to be displaying a log generating ssh key pairs.


In order to debug this:

How do I first manually execute the user-data script using cloud-init to test functionality without testing terraform deployment?

How do I get better insight into the cloud-init service or why it's not executing the script?

For example, I don't see anything related to the user-data script from journalctl -u cloud-init

ubuntu-server-builder commented 1 year ago

Launchpad user Colin Williams(colin.williams.seattle) wrote on 2022-02-11T21:17:27.388213+00:00

Launchpad attachments: cloud-init.logs.tar.gz

ubuntu-server-builder commented 1 year ago

Launchpad user James Falcon(falcojr) wrote on 2022-02-14T16:24:08.220451+00:00

To answer your specific questions: "How do I first manually execute the user-data script using cloud-init to test functionality without testing terraform deployment?" To have cloud-init re-run a module, you can use 'cloud-init single'. For your scenario, that should look like 'sudo cloud-init single --name scripts-user --frequency always'

"How do I get better insight into the cloud-init service or why it's not executing the script?" /var/log/cloud-init-output.log should tell you if there are any failures.

"For example, I don't see anything related to the user-data script from journalctl -u cloud-init" It won't be journalctl -u cloud-init. It will be in /var/log/cloud-init-output.log

More generally: Runcmd really is a combination of two separate modules. The runcmd module will write a script to /var/lib/cloud/instance/scripts. If you see the script there, runcmd has done its job. The module to actually run the script there is the scripts-user module. If you see 'scripts-user' in /var/log/cloud-init.log with no errors, then it should have run the script.

You could add a 'set -x' to the top of the runcmd script, and it should write the extra output to /var/log/cloud-init-output.log as you'd expect from any other script with that set.

Also, see the second note on https://cloudinit.readthedocs.io/en/latest/topics/modules.html#runcmd . We suggest using "/var/tmp" rather than "/tmp".

If I run your script (which won't work for me) with an additional 'set -x', I see the following at the end of /var/log/cloud-init-output.log:

Cloud-init v. 21.4 running 'modules:config' at Mon, 14 Feb 2022 16:21:12 +0000. Up 8.52 seconds.

ubuntu-server-builder commented 1 year ago

Launchpad user Colin Williams(colin.williams.seattle) wrote on 2022-02-15T23:20:51.278030+00:00

Hello,

I had copied parts of my email from the mailing list description. I was also in the irc chat where I was instructed to collect logs.

Regarding your error, you are not using an amazon instance which has /dev/nvme2n1 as a device. Of course it would fail.

ubuntu-server-builder commented 1 year ago

Launchpad user Colin Williams(colin.williams.seattle) wrote on 2022-02-15T23:22:56.773368+00:00

I attached the debugging logs specifically related to the issue as instructed. It's not clear that you have taken a look at them.

Best Regards,

Colin Williams

ubuntu-server-builder commented 1 year ago

Launchpad user James Falcon(falcojr) wrote on 2022-02-16T15:06:37.715660+00:00

Hey Colin. The intent of my message was to answer your questions. I did indeed take a look at the logs. Why do you think that I did not?

I'm aware that your script will fail on my machine, which is why I said "If I run your script (which won't work for me)". You asked, "Is there a way to capture any errors or even the script run itself?", so I pasted it (knowing it would fail) to show an example of what it would look like to capture the errors along with the script run itself.

ubuntu-server-builder commented 1 year ago

Launchpad user Chad Smith(chad.smith) wrote on 2022-02-16T22:39:11.186484+00:00

Hey James, thanks for trying to point the bug in the direction of getting more details, hence the incomplete status on this bug.

Hey Colin,

There a few things strange here that don't quite add up to typical cloud-init behavior, so please forgive our inabilty to quickly get to the cause of the symptoms you are seeing.

Ultimately we need to find out why cloud-final.service is not enabled and active on your system because something has disabled or masked it from the systemd multi-user.target boot goal.

I think you mentioned in IRC that cloud-final.service is loaded and "dead" and that systemd-analyze critical-chain told you your boot hasn't finished yet.

"When I run systemd-analyze critical-chain it says Bootup is not yet finished. Please try again later"

If something is blocking boot from completing it'll also likely block cloud-final.service which can only start after the following services have been started. After=network-online.target cloud-config.service rc-local.service After=multi-user.target

I'd suggest checking what systemd service is holding up your system. I see a errors in your journal.txt that point to network connection trouble like motd accessing network info, so something with a broken network config could significantly delay network-online.target

Some things to try:

Because cloud-final.service has not yet been exected as part of the systemd boot target, none of the cloud config final modules are run which means the module config-scripts-user won't run and your runcmd isn't executed, which is why you don't see a log in cloud-init.log like the following:

subp.py[DEBUG]: Running command ['/var/lib/cloud/instance/scripts/runcmd']

Looking through your cloud-init.log there are some inconsistencies I see unrelated to why cloud-final.service isn't running, but they represent underlying symptoms of problems within the AMI you are using. I put them here to give some guidance on debugging and triage steps.

  1. this log shows 4 different Ec2 instance ids (i-0b5d73191f895c644', i-0d10d51a5d9be8963, i-04aac2fce436160e8, wi-0add919ad0e973ab1. This makes me think this VM created from an unclean golden image or migrated between networks. Anytime cloud-init sees a new instance id from Ec2's instance metadata sevice it should re-run all configuration as it assumed instance metadata config has changed.

    • You can see this with all the "Update datasource metadata and network config due to events: boot-new-instance" logs in cloud-init.log

    • If this VM was launched or cloned from a golden image built on Jan 1st maybe. That golden image process should ultimately run sudo cloud-init clean --logs and likely rm /etc/machine-id before shutdown and image capture. If a cloned image has not run sudo cloud-init clean cloud-init's semaphores are left around to prevent re-running config modules unless it reacts to instance-id changes across reboot.

  2. There are 5 boot records seen from cloud-init analyze show only the first two of which look to trigger all cloud-init boot stages: init-local, init-network, modules-config, modules-final. The last three boot records don't show running cloud-final.service for some reason we have yet to determine, from your systemctl status output

    3.Your "Shellified 10 commands" log when actually writing out runcmd user-data didn't actually happen until the last boot of this image in the "modules:config" stage. So, userdata with runcmd wasn't actually present until the launch of instance-id i-0add919ad0e973ab1

2022-02-11 18:15:15,598 - util.py[DEBUG]: Creating symbolic link from '/var/lib/cloud/instance' => '/var/lib/cloud/instances/i-0add919ad0e973ab1' ... 2022-02-11 18:15:29,531 - util.py[DEBUG]: Shellified 10 commands.

On this most recent boot record, we also see the actual script being written out.

2022-02-11 18:15:29,532 - util.py[DEBUG]: Writing to /var/lib/cloud/instances/i-0add919ad0e973ab1/scripts/runcmd - wb: [700] 333 bytes

But again, no cloud-final.service is active on this boot as seen from systemctl status.

I would have expected to see something like the following from systemctl status cloud-final.service .. Active: active (exited) since Mon 2022-02-14 23:01:33 UTC; 1 day 22h ago ...

I don't see any Breaking ordering cycle type logs in journal.txt so it doesn't look lik

ubuntu-server-builder commented 1 year ago

Launchpad user Chad Smith(chad.smith) wrote on 2022-02-16T22:39:55.222969+00:00

I don't see any Breaking ordering cycle type logs in journal.txt so it doesn't look like there are competing systemd dependency chains that caused cloud-final.service to get disabled.

ubuntu-server-builder commented 1 year ago

Launchpad user Chad Smith(chad.smith) wrote on 2022-02-16T22:44:13.751950+00:00

parting shot: given that cloud-final.service is set After=multi-user.target and previously you mentioned systemd-analyze told you the boot isn't complete. I'd guess you'd need to debug why multi-user.target isn't reached on that system, because that target may likely blocking cloud-final.service from being kicked off.

ubuntu-server-builder commented 1 year ago

Launchpad user Launchpad Janitor(janitor) wrote on 2022-04-18T04:17:21.187212+00:00

[Expired for cloud-init because there has been no activity for 60 days.]