canonical / cloud-init

Official upstream for the cloud-init: cloud instance initialization
https://cloud-init.io/
Other
2.87k stars 856 forks source link

After rebooting a machine, cloud-init hits error: Exec format error: b'/var/lib/cloud/instance/scripts/runcmd' #5699

Open maburlik opened 1 week ago

maburlik commented 1 week ago

Bug report

I have an infrastructure where I'm deploying VMs and running cloud-init using a NoCloud data source package. Frequently, there may be reboots during the early stages after the machine comes up, and the cloud-init configuration that may be sent by the user used for configuring the VM are out of the realm of my control. It frequently includes entries under 'runcmd'.

What I'm finding appears to be a race condition when cloud-init is abruptly interrupted by an out of band user-driven reboot request.

The reboot often comes in before the runcmd module has completed running. Here is a happy case summary as emitted in the cloud-init.log. The runcmd module is run after reboot: machine: vm_ssh_test_2_ubuntu

2024-09-12 19:02:16,950 - modules.py[DEBUG]: Running module runcmd (<module 'cloudinit.config.cc_runcmd' from '/usr/lib/python3/dist-packages/cloudinit/config/cc_runcmd.py'>) with frequency once-per-instance
2024-09-12 19:02:16,950 - handlers.py[DEBUG]: start: modules-config/config-runcmd: running config-runcmd with frequency once-per-instance
2024-09-12 19:02:16,951 - util.py[DEBUG]: Writing to /var/lib/cloud/instances/custom-hostname-961713c7a7a53c3c/sem/config_runcmd - wb: [644] 24 bytes
2024-09-12 19:02:16,951 - helpers.py[DEBUG]: Running config-runcmd using lock (<FileLock using file '/var/lib/cloud/instances/custom-hostname-961713c7a7a53c3c/sem/config_runcmd'>)
2024-09-12 19:02:16,951 - util.py[DEBUG]: Shellified 2 commands.
2024-09-12 19:02:16,951 - util.py[DEBUG]: Writing to /var/lib/cloud/instances/custom-hostname-961713c7a7a53c3c/scripts/runcmd - wb: [700] 129 bytes
2024-09-12 19:02:16,951 - handlers.py[DEBUG]: finish: modules-config/config-runcmd: SUCCESS: config-runcmd ran successfully
...
2024-09-12 19:02:17,827 - subp.py[DEBUG]: Running command ['/var/lib/cloud/instance/scripts/runcmd'] with allowed return codes [0] (shell=False, capture=False)
2024-09-12 19:02:17,848 - handlers.py[DEBUG]: finish: modules-final/config-scripts_user: SUCCESS: config-scripts_user ran successfully

Now here is the reproduced failure. There are no mentions of the runcmd module running prior to reboot, yet there is a log saying config-runcmd already ran, even though this is a fresh Ubuntu VM image. machine: vm_ssh_test_1_ubuntu

2024-09-12 19:02:16,845 - modules.py[DEBUG]: Running module runcmd (<module 'cloudinit.config.cc_runcmd' from '/usr/lib/python3/dist-packages/cloudinit/config/cc_runcmd.py'>) with frequency once-per-instance
2024-09-12 19:02:16,845 - handlers.py[DEBUG]: start: modules-config/config-runcmd: running config-runcmd with frequency once-per-instance
2024-09-12 19:02:16,845 - helpers.py[DEBUG]: config-runcmd already ran (freq=once-per-instance)
2024-09-12 19:02:16,845 - handlers.py[DEBUG]: finish: modules-config/config-runcmd: SUCCESS: config-runcmd previously ran
...
2024-09-12 19:02:17,992 - subp.py[DEBUG]: Running command ['/var/lib/cloud/instance/scripts/runcmd'] with allowed return codes [0] (shell=False, capture=False)
2024-09-12 19:02:18,010 - subp.py[DEBUG]: Exec format error. Missing #! in script?
Command: ['/var/lib/cloud/instance/scripts/runcmd']
Exit code: -
Reason: [Errno 8] Exec format error: b'/var/lib/cloud/instance/scripts/runcmd'
Stdout: -
Stderr: -
2024-09-12 19:02:18,010 - cc_scripts_user.py[WARNING]: Failed to run module scripts_user (scripts in /var/lib/cloud/instance/scripts)
2024-09-12 19:02:18,010 - handlers.py[DEBUG]: finish: modules-final/config-scripts_user: FAIL: running config-scripts_user with frequency once-per-instance
2024-09-12 19:02:18,010 - util.py[WARNING]: Running module scripts_user (<module 'cloudinit.config.cc_scripts_user' from '/usr/lib/python3/dist-packages/cloudinit/config/cc_scripts_user.py'>) failed
2024-09-12 19:02:18,011 - util.py[DEBUG]: Running module scripts_user (<module 'cloudinit.config.cc_scripts_user' from '/usr/lib/python3/dist-packages/cloudinit/config/cc_scripts_user.py'>) failed
Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/cloudinit/config/modules.py", line 286, in _run_modules
    ran, _r = cc.run(
  File "/usr/lib/python3/dist-packages/cloudinit/cloud.py", line 60, in run
    return self._runners.run(name, functor, args, freq, clear_on_fail)
  File "/usr/lib/python3/dist-packages/cloudinit/helpers.py", line 156, in run
    results = functor(**args)
  File "/usr/lib/python3/dist-packages/cloudinit/config/cc_scripts_user.py", line 38, in handle
    subp.runparts(runparts_path)
  File "/usr/lib/python3/dist-packages/cloudinit/subp.py", line 390, in runparts
    raise RuntimeError(
RuntimeError: Runparts: 1 failures (runcmd) in 1 attempted commands

My speculation is that there is no marker file written that would indicate whether the /var/lib/cloud/instance/scripts/runcmd has flushed to disk, which would explain that the file is missing a #! shebang.

Here is an example of this intermediate file written by cloud-init when all goes well:

# head -n5 /var/lib/cloud/instance/scripts/runcmd
#!/bin/sh
echo 'This is a VM configured with cloud-init!' > /var/tmp/maxb-cloud-init.txt
/opt/maxb-vm/first-boot-command.sh

As a side note, the /opt/maxb-vm/first-boot-command.sh is embedded in the image prior to boot, and deterministically always has the shebang, and image is flushed to ensure the write.

Steps to reproduce the problem

This is a race condition and happens rarely, but becomes a blocking issue when it occurs.

  1. Create a VM with user-data containing runcmd.
  2. Reboot the machine before cloud-init completes runcmd module.
  3. Upon the machine returning, observe trace saying the config-runcmd module has run, but the /var/lib/cloud/instance/scripts/runcmd appears to be invalid.

Environment details

cloud-init logs

cloud-init_redacted.log

maburlik commented 2 days ago

Hit this again and saved the file:

/var/lib/cloud/instance/scripts/runcmd

It was in fact empty.