canonical / cloud-init

Official upstream for the cloud-init: cloud instance initialization
https://cloud-init.io/
Other
2.99k stars 883 forks source link

After rebooting a machine, cloud-init hits error: Exec format error: b'/var/lib/cloud/instance/scripts/runcmd' #5699

Open maburlik opened 2 months ago

maburlik commented 2 months ago

Bug report

I have an infrastructure where I'm deploying VMs and running cloud-init using a NoCloud data source package. Frequently, there may be reboots during the early stages after the machine comes up, and the cloud-init configuration that may be sent by the user used for configuring the VM are out of the realm of my control. It frequently includes entries under 'runcmd'.

What I'm finding appears to be a race condition when cloud-init is abruptly interrupted by an out of band user-driven reboot request.

The reboot often comes in before the runcmd module has completed running. Here is a happy case summary as emitted in the cloud-init.log. The runcmd module is run after reboot: machine: vm_ssh_test_2_ubuntu

2024-09-12 19:02:16,950 - modules.py[DEBUG]: Running module runcmd (<module 'cloudinit.config.cc_runcmd' from '/usr/lib/python3/dist-packages/cloudinit/config/cc_runcmd.py'>) with frequency once-per-instance
2024-09-12 19:02:16,950 - handlers.py[DEBUG]: start: modules-config/config-runcmd: running config-runcmd with frequency once-per-instance
2024-09-12 19:02:16,951 - util.py[DEBUG]: Writing to /var/lib/cloud/instances/custom-hostname-961713c7a7a53c3c/sem/config_runcmd - wb: [644] 24 bytes
2024-09-12 19:02:16,951 - helpers.py[DEBUG]: Running config-runcmd using lock (<FileLock using file '/var/lib/cloud/instances/custom-hostname-961713c7a7a53c3c/sem/config_runcmd'>)
2024-09-12 19:02:16,951 - util.py[DEBUG]: Shellified 2 commands.
2024-09-12 19:02:16,951 - util.py[DEBUG]: Writing to /var/lib/cloud/instances/custom-hostname-961713c7a7a53c3c/scripts/runcmd - wb: [700] 129 bytes
2024-09-12 19:02:16,951 - handlers.py[DEBUG]: finish: modules-config/config-runcmd: SUCCESS: config-runcmd ran successfully
...
2024-09-12 19:02:17,827 - subp.py[DEBUG]: Running command ['/var/lib/cloud/instance/scripts/runcmd'] with allowed return codes [0] (shell=False, capture=False)
2024-09-12 19:02:17,848 - handlers.py[DEBUG]: finish: modules-final/config-scripts_user: SUCCESS: config-scripts_user ran successfully

Now here is the reproduced failure. There are no mentions of the runcmd module running prior to reboot, yet there is a log saying config-runcmd already ran, even though this is a fresh Ubuntu VM image. machine: vm_ssh_test_1_ubuntu

2024-09-12 19:02:16,845 - modules.py[DEBUG]: Running module runcmd (<module 'cloudinit.config.cc_runcmd' from '/usr/lib/python3/dist-packages/cloudinit/config/cc_runcmd.py'>) with frequency once-per-instance
2024-09-12 19:02:16,845 - handlers.py[DEBUG]: start: modules-config/config-runcmd: running config-runcmd with frequency once-per-instance
2024-09-12 19:02:16,845 - helpers.py[DEBUG]: config-runcmd already ran (freq=once-per-instance)
2024-09-12 19:02:16,845 - handlers.py[DEBUG]: finish: modules-config/config-runcmd: SUCCESS: config-runcmd previously ran
...
2024-09-12 19:02:17,992 - subp.py[DEBUG]: Running command ['/var/lib/cloud/instance/scripts/runcmd'] with allowed return codes [0] (shell=False, capture=False)
2024-09-12 19:02:18,010 - subp.py[DEBUG]: Exec format error. Missing #! in script?
Command: ['/var/lib/cloud/instance/scripts/runcmd']
Exit code: -
Reason: [Errno 8] Exec format error: b'/var/lib/cloud/instance/scripts/runcmd'
Stdout: -
Stderr: -
2024-09-12 19:02:18,010 - cc_scripts_user.py[WARNING]: Failed to run module scripts_user (scripts in /var/lib/cloud/instance/scripts)
2024-09-12 19:02:18,010 - handlers.py[DEBUG]: finish: modules-final/config-scripts_user: FAIL: running config-scripts_user with frequency once-per-instance
2024-09-12 19:02:18,010 - util.py[WARNING]: Running module scripts_user (<module 'cloudinit.config.cc_scripts_user' from '/usr/lib/python3/dist-packages/cloudinit/config/cc_scripts_user.py'>) failed
2024-09-12 19:02:18,011 - util.py[DEBUG]: Running module scripts_user (<module 'cloudinit.config.cc_scripts_user' from '/usr/lib/python3/dist-packages/cloudinit/config/cc_scripts_user.py'>) failed
Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/cloudinit/config/modules.py", line 286, in _run_modules
    ran, _r = cc.run(
  File "/usr/lib/python3/dist-packages/cloudinit/cloud.py", line 60, in run
    return self._runners.run(name, functor, args, freq, clear_on_fail)
  File "/usr/lib/python3/dist-packages/cloudinit/helpers.py", line 156, in run
    results = functor(**args)
  File "/usr/lib/python3/dist-packages/cloudinit/config/cc_scripts_user.py", line 38, in handle
    subp.runparts(runparts_path)
  File "/usr/lib/python3/dist-packages/cloudinit/subp.py", line 390, in runparts
    raise RuntimeError(
RuntimeError: Runparts: 1 failures (runcmd) in 1 attempted commands

My speculation is that there is no marker file written that would indicate whether the /var/lib/cloud/instance/scripts/runcmd has flushed to disk, which would explain that the file is missing a #! shebang.

Here is an example of this intermediate file written by cloud-init when all goes well:

# head -n5 /var/lib/cloud/instance/scripts/runcmd
#!/bin/sh
echo 'This is a VM configured with cloud-init!' > /var/tmp/maxb-cloud-init.txt
/opt/maxb-vm/first-boot-command.sh

As a side note, the /opt/maxb-vm/first-boot-command.sh is embedded in the image prior to boot, and deterministically always has the shebang, and image is flushed to ensure the write.

Steps to reproduce the problem

This is a race condition and happens rarely, but becomes a blocking issue when it occurs.

  1. Create a VM with user-data containing runcmd.
  2. Reboot the machine before cloud-init completes runcmd module.
  3. Upon the machine returning, observe trace saying the config-runcmd module has run, but the /var/lib/cloud/instance/scripts/runcmd appears to be invalid.

Environment details

cloud-init logs

cloud-init_redacted.log

maburlik commented 1 month ago

Hit this again and saved the file:

/var/lib/cloud/instance/scripts/runcmd

It was in fact empty.

aciba90 commented 1 month ago

Per your logs (not full, they don't contain the first execution of runcmd), we can see that cloud-init does not execute runcmd because it was already executed:

2024-09-12 19:02:16,845 - modules.py[DEBUG]: Running module runcmd (<module 'cloudinit.config.cc_runcmd' from '/usr/lib/python3/dist-packages/cloudinit/config/cc_runcmd.py'>) with frequency once-per-instance
2024-09-12 19:02:16,845 - handlers.py[DEBUG]: start: modules-config/config-runcmd: running config-runcmd with frequency once-per-instance
2024-09-12 19:02:16,845 - helpers.py[DEBUG]: config-runcmd already ran (freq=once-per-instance)
2024-09-12 19:02:16,845 - handlers.py[DEBUG]: finish: modules-config/config-runcmd: SUCCESS: config-runcmd previously ran

Hit this again and saved the file:

/var/lib/cloud/instance/scripts/runcmd

It was in fact empty.

And if this is the case then, what very likely happened was that the system reboot while writing the script file before flushing it, see [1, 2].

What I'm finding appears to be a race condition when cloud-init is abruptly interrupted by an out of band user-driven reboot request.

Abruptly rebooting the system while cloud-init is executing is not a supported scenario and this is not considered a bug.

Could you detect such cases and re-run cloud-init (with cloud-init clean and reboot)? Or running runcmd only with https://docs.cloud-init.io/en/latest/reference/cli.html#single?

That being said, what's your expectation from this issue? How do you think cloud-init should behave in such a case?

TheRealFalcon commented 1 month ago

Given that the reproduction steps include an unsupported step: Reboot the machine before cloud-init completes runcmd module, I'm going to close this issue. Feel free to re-open the discussion if you think we've missed something.

maburlik commented 1 month ago

Per your logs (not full, they don't contain the first execution of runcmd), we can see that cloud-init does not execute runcmd because it was already executed:

2024-09-12 19:02:16,845 - modules.py[DEBUG]: Running module runcmd (<module 'cloudinit.config.cc_runcmd' from '/usr/lib/python3/dist-packages/cloudinit/config/cc_runcmd.py'>) with frequency once-per-instance
2024-09-12 19:02:16,845 - handlers.py[DEBUG]: start: modules-config/config-runcmd: running config-runcmd with frequency once-per-instance
2024-09-12 19:02:16,845 - helpers.py[DEBUG]: config-runcmd already ran (freq=once-per-instance)
2024-09-12 19:02:16,845 - handlers.py[DEBUG]: finish: modules-config/config-runcmd: SUCCESS: config-runcmd previously ran

Hit this again and saved the file:

/var/lib/cloud/instance/scripts/runcmd

It was in fact empty.

And if this is the case then, what very likely happened was that the system reboot while writing the script file before flushing it, see [1, 2].

What I'm finding appears to be a race condition when cloud-init is abruptly interrupted by an out of band user-driven reboot request.

Abruptly rebooting the system while cloud-init is executing is not a supported scenario and this is not considered a bug.

Could you detect such cases and re-run cloud-init (with cloud-init clean and reboot)? Or running runcmd only with https://docs.cloud-init.io/en/latest/reference/cli.html#single?

That being said, what's your expectation from this issue? How do you think cloud-init should behave in such a case?

Thank you @aciba90 for your response. I'd like to get a bit of clarification on a few points.

Abruptly rebooting the system while cloud-init is executing is not a supported scenario and this is not considered a bug.

A user-driven reboot may happen during the execution of cloud-init, since cloud-init runs in the background (as administered by the platform) while the user runs their private operations. These are distinct managing actors and we cannot prevent the user from running 'shutdown -r now'.

For things like runcmd, these commands may run arbitrarily long if downloading large packages or running cpu-intensive operations, so we cannot block the user from running operations behind a "cloud-init has completed" hook.

That being said, what's your expectation from this issue? How do you think cloud-init should behave in such a case?

I would expect cloud-init state tracking of whether a module has finished running, to be deterministically atomic. The way to check whether this has run would be by persisting the flush and subsequently writing a marker file. Then on re-entry to check the marker file rather than the context file that may or may not have flushed. This is just one way, maybe you have other ideas for the implementation detail.

maburlik commented 1 month ago

@TheRealFalcon I don't have permission to reopen this.

TheRealFalcon commented 1 month ago

@maburlik Thanks for the additional context.

There a quite a number of cases that would cause the system to come up in a bad state if a reboot happens while cloud-init is executing. Flushing after every file write would be costly and increase our boot times in ways we do not want. If you're expecting reboots to be happening during cloud-init runs, I'm wondering if looking at tuning your filesystem mount options is a better way forward.

That said, I do see how this case in particular is a bit more problematic, so I don't think it'd be unreasonable for us to add a sync in this case.