Closed ubuntu-server-builder closed 1 year ago
Launchpad user Colin Williams(colin.williams.seattle) wrote on 2022-02-11T21:17:27.388213+00:00
Launchpad attachments: cloud-init.logs.tar.gz
Launchpad user James Falcon(falcojr) wrote on 2022-02-14T16:24:08.220451+00:00
To answer your specific questions: "How do I first manually execute the user-data script using cloud-init to test functionality without testing terraform deployment?" To have cloud-init re-run a module, you can use 'cloud-init single'. For your scenario, that should look like 'sudo cloud-init single --name scripts-user --frequency always'
"How do I get better insight into the cloud-init service or why it's not executing the script?" /var/log/cloud-init-output.log should tell you if there are any failures.
"For example, I don't see anything related to the user-data script from journalctl -u cloud-init" It won't be journalctl -u cloud-init. It will be in /var/log/cloud-init-output.log
More generally: Runcmd really is a combination of two separate modules. The runcmd module will write a script to /var/lib/cloud/instance/scripts. If you see the script there, runcmd has done its job. The module to actually run the script there is the scripts-user module. If you see 'scripts-user' in /var/log/cloud-init.log with no errors, then it should have run the script.
You could add a 'set -x' to the top of the runcmd script, and it should write the extra output to /var/log/cloud-init-output.log as you'd expect from any other script with that set.
Also, see the second note on https://cloudinit.readthedocs.io/en/latest/topics/modules.html#runcmd . We suggest using "/var/tmp" rather than "/tmp".
If I run your script (which won't work for me) with an additional 'set -x', I see the following at the end of /var/log/cloud-init-output.log:
Cloud-init v. 21.4 running 'modules:config' at Mon, 14 Feb 2022 16:21:12 +0000. Up 8.52 seconds.
Launchpad user Colin Williams(colin.williams.seattle) wrote on 2022-02-15T23:20:51.278030+00:00
Hello,
I had copied parts of my email from the mailing list description. I was also in the irc chat where I was instructed to collect logs.
Regarding your error, you are not using an amazon instance which has /dev/nvme2n1 as a device. Of course it would fail.
Launchpad user Colin Williams(colin.williams.seattle) wrote on 2022-02-15T23:22:56.773368+00:00
I attached the debugging logs specifically related to the issue as instructed. It's not clear that you have taken a look at them.
Best Regards,
Colin Williams
Launchpad user James Falcon(falcojr) wrote on 2022-02-16T15:06:37.715660+00:00
Hey Colin. The intent of my message was to answer your questions. I did indeed take a look at the logs. Why do you think that I did not?
I'm aware that your script will fail on my machine, which is why I said "If I run your script (which won't work for me)". You asked, "Is there a way to capture any errors or even the script run itself?", so I pasted it (knowing it would fail) to show an example of what it would look like to capture the errors along with the script run itself.
Launchpad user Chad Smith(chad.smith) wrote on 2022-02-16T22:39:11.186484+00:00
Hey James, thanks for trying to point the bug in the direction of getting more details, hence the incomplete status on this bug.
Hey Colin,
There a few things strange here that don't quite add up to typical cloud-init behavior, so please forgive our inabilty to quickly get to the cause of the symptoms you are seeing.
Ultimately we need to find out why cloud-final.service is not enabled and active on your system because something has disabled or masked it from the systemd multi-user.target boot goal.
I think you mentioned in IRC that cloud-final.service is loaded and "dead" and that systemd-analyze critical-chain
told you your boot hasn't finished yet.
"When I run systemd-analyze critical-chain it says Bootup is not yet finished. Please try again later"
If something is blocking boot from completing it'll also likely block cloud-final.service which can only start after the following services have been started. After=network-online.target cloud-config.service rc-local.service After=multi-user.target
I'd suggest checking what systemd service is holding up your system. I see a errors in your journal.txt that point to network connection trouble like motd accessing network info, so something with a broken network config could significantly delay network-online.target
Some things to try:
Because cloud-final.service has not yet been exected as part of the systemd boot target, none of the cloud config final modules are run which means the module config-scripts-user won't run and your runcmd isn't executed, which is why you don't see a log in cloud-init.log like the following:
subp.py[DEBUG]: Running command ['/var/lib/cloud/instance/scripts/runcmd']
Looking through your cloud-init.log there are some inconsistencies I see unrelated to why cloud-final.service isn't running, but they represent underlying symptoms of problems within the AMI you are using. I put them here to give some guidance on debugging and triage steps.
this log shows 4 different Ec2 instance ids (i-0b5d73191f895c644', i-0d10d51a5d9be8963, i-04aac2fce436160e8, wi-0add919ad0e973ab1. This makes me think this VM created from an unclean golden image or migrated between networks. Anytime cloud-init sees a new instance id from Ec2's instance metadata sevice it should re-run all configuration as it assumed instance metadata config has changed.
You can see this with all the "Update datasource metadata and network config due to events: boot-new-instance" logs in cloud-init.log
If this VM was launched or cloned from a golden image built on Jan 1st maybe. That golden image process should ultimately run sudo cloud-init clean --logs
and likely rm /etc/machine-id before shutdown and image capture. If a cloned image has not run sudo cloud-init clean
cloud-init's semaphores are left around to prevent re-running config modules unless it reacts to instance-id changes across reboot.
There are 5 boot records seen from cloud-init analyze show
only the first two of which look to trigger all cloud-init boot stages: init-local, init-network, modules-config, modules-final. The last three boot records don't show running cloud-final.service for some reason we have yet to determine, from your systemctl status output
3.Your "Shellified 10 commands" log when actually writing out runcmd user-data didn't actually happen until the last boot of this image in the "modules:config" stage. So, userdata with runcmd wasn't actually present until the launch of instance-id i-0add919ad0e973ab1
2022-02-11 18:15:15,598 - util.py[DEBUG]: Creating symbolic link from '/var/lib/cloud/instance' => '/var/lib/cloud/instances/i-0add919ad0e973ab1' ... 2022-02-11 18:15:29,531 - util.py[DEBUG]: Shellified 10 commands.
On this most recent boot record, we also see the actual script being written out.
2022-02-11 18:15:29,532 - util.py[DEBUG]: Writing to /var/lib/cloud/instances/i-0add919ad0e973ab1/scripts/runcmd - wb: [700] 333 bytes
But again, no cloud-final.service is active on this boot as seen from systemctl status.
I would have expected to see something like the following from systemctl status cloud-final.service
..
Active: active (exited) since Mon 2022-02-14 23:01:33 UTC; 1 day 22h ago
...
I don't see any Breaking ordering cycle type logs in journal.txt so it doesn't look lik
Launchpad user Chad Smith(chad.smith) wrote on 2022-02-16T22:39:55.222969+00:00
I don't see any Breaking ordering cycle type logs in journal.txt so it doesn't look like there are competing systemd dependency chains that caused cloud-final.service to get disabled.
Launchpad user Chad Smith(chad.smith) wrote on 2022-02-16T22:44:13.751950+00:00
parting shot: given that cloud-final.service is set After=multi-user.target and previously you mentioned systemd-analyze told you the boot isn't complete. I'd guess you'd need to debug why multi-user.target isn't reached on that system, because that target may likely blocking cloud-final.service from being kicked off.
Launchpad user Launchpad Janitor(janitor) wrote on 2022-04-18T04:17:21.187212+00:00
[Expired for cloud-init because there has been no activity for 60 days.]
This bug was originally filed in Launchpad as LP: #1960678
Launchpad details
Launchpad user Colin Williams(colin.williams.seattle) wrote on 2022-02-11T21:17:27.388213+00:00
From the mailing list archive: https://lists.launchpad.net/cloud-init/msg00385.html
Here's the debugging information based on Chads' response which may not have been delivered to the list. Note that I have to type out the details due to firewall and network details. It looks like everything is ok based on described. I do have to run the script as root for it to succeed. However the first command to write the dummy file doesn't require root.
To debug further where should I specifically look for a log that captures the run command script execution after Shellification?
Is there a way to capture any errors or even the script run itself?
cloud-config
runcmd:
output : {all : '| tee -a /var/log/cloud-init-output.log'}
status: running time: Wed, 09 Feb 2022 21:46:00 +0000 detail: DataSourceEc2
when I look at /var/lib/cloud-init.log I see relevant lines that tell me
util.py start: modules-config/config-runcmd: running config--runcmd with frequency once-per-instance util.py Shellified 9 commands Writing to /var/lib/cloud/instances/.... main.py Ran 14 modules with 0 failures
Don't see any errors
I checked the ownership. The file is rwx owner, and owned by root root
When I execute the script at /var/lib/path/to/instance/run-cmd using sudo, it runs as expected.
On Wed, Feb 9, 2022 at 10:31 PM Colin Williams colin.williams.seattle@xxxxxxxxx wrote: