Open maburlik opened 2 months ago
Hit this again and saved the file:
/var/lib/cloud/instance/scripts/runcmd
It was in fact empty.
Per your logs (not full, they don't contain the first execution of runcmd), we can see that cloud-init does not execute runcmd
because it was already executed:
2024-09-12 19:02:16,845 - modules.py[DEBUG]: Running module runcmd (<module 'cloudinit.config.cc_runcmd' from '/usr/lib/python3/dist-packages/cloudinit/config/cc_runcmd.py'>) with frequency once-per-instance
2024-09-12 19:02:16,845 - handlers.py[DEBUG]: start: modules-config/config-runcmd: running config-runcmd with frequency once-per-instance
2024-09-12 19:02:16,845 - helpers.py[DEBUG]: config-runcmd already ran (freq=once-per-instance)
2024-09-12 19:02:16,845 - handlers.py[DEBUG]: finish: modules-config/config-runcmd: SUCCESS: config-runcmd previously ran
Hit this again and saved the file:
/var/lib/cloud/instance/scripts/runcmd
It was in fact empty.
And if this is the case then, what very likely happened was that the system reboot while writing the script file before flushing it, see [1, 2].
What I'm finding appears to be a race condition when cloud-init is abruptly interrupted by an out of band user-driven reboot request.
Abruptly rebooting the system while cloud-init is executing is not a supported scenario and this is not considered a bug.
Could you detect such cases and re-run cloud-init (with cloud-init clean
and reboot)? Or running runcmd
only with https://docs.cloud-init.io/en/latest/reference/cli.html#single?
That being said, what's your expectation from this issue? How do you think cloud-init should behave in such a case?
Given that the reproduction steps include an unsupported step: Reboot the machine before cloud-init completes runcmd module
, I'm going to close this issue. Feel free to re-open the discussion if you think we've missed something.
Per your logs (not full, they don't contain the first execution of runcmd), we can see that cloud-init does not execute
runcmd
because it was already executed:2024-09-12 19:02:16,845 - modules.py[DEBUG]: Running module runcmd (<module 'cloudinit.config.cc_runcmd' from '/usr/lib/python3/dist-packages/cloudinit/config/cc_runcmd.py'>) with frequency once-per-instance 2024-09-12 19:02:16,845 - handlers.py[DEBUG]: start: modules-config/config-runcmd: running config-runcmd with frequency once-per-instance 2024-09-12 19:02:16,845 - helpers.py[DEBUG]: config-runcmd already ran (freq=once-per-instance) 2024-09-12 19:02:16,845 - handlers.py[DEBUG]: finish: modules-config/config-runcmd: SUCCESS: config-runcmd previously ran
Hit this again and saved the file:
/var/lib/cloud/instance/scripts/runcmd
It was in fact empty.
And if this is the case then, what very likely happened was that the system reboot while writing the script file before flushing it, see [1, 2].
What I'm finding appears to be a race condition when cloud-init is abruptly interrupted by an out of band user-driven reboot request.
Abruptly rebooting the system while cloud-init is executing is not a supported scenario and this is not considered a bug.
Could you detect such cases and re-run cloud-init (with
cloud-init clean
and reboot)? Or runningruncmd
only with https://docs.cloud-init.io/en/latest/reference/cli.html#single?That being said, what's your expectation from this issue? How do you think cloud-init should behave in such a case?
- [1] cc_runcmd.py#L48
- [2] util.py#L2301
Thank you @aciba90 for your response. I'd like to get a bit of clarification on a few points.
Abruptly rebooting the system while cloud-init is executing is not a supported scenario and this is not considered a bug.
A user-driven reboot may happen during the execution of cloud-init, since cloud-init runs in the background (as administered by the platform) while the user runs their private operations. These are distinct managing actors and we cannot prevent the user from running 'shutdown -r now'.
For things like runcmd, these commands may run arbitrarily long if downloading large packages or running cpu-intensive operations, so we cannot block the user from running operations behind a "cloud-init has completed" hook.
That being said, what's your expectation from this issue? How do you think cloud-init should behave in such a case?
I would expect cloud-init state tracking of whether a module has finished running, to be deterministically atomic. The way to check whether this has run would be by persisting the flush and subsequently writing a marker file. Then on re-entry to check the marker file rather than the context file that may or may not have flushed. This is just one way, maybe you have other ideas for the implementation detail.
@TheRealFalcon I don't have permission to reopen this.
@maburlik Thanks for the additional context.
There a quite a number of cases that would cause the system to come up in a bad state if a reboot happens while cloud-init is executing. Flushing after every file write would be costly and increase our boot times in ways we do not want. If you're expecting reboots to be happening during cloud-init runs, I'm wondering if looking at tuning your filesystem mount options is a better way forward.
That said, I do see how this case in particular is a bit more problematic, so I don't think it'd be unreasonable for us to add a sync in this case.
Bug report
I have an infrastructure where I'm deploying VMs and running cloud-init using a NoCloud data source package. Frequently, there may be reboots during the early stages after the machine comes up, and the cloud-init configuration that may be sent by the user used for configuring the VM are out of the realm of my control. It frequently includes entries under 'runcmd'.
What I'm finding appears to be a race condition when cloud-init is abruptly interrupted by an out of band user-driven reboot request.
The reboot often comes in before the runcmd module has completed running. Here is a happy case summary as emitted in the cloud-init.log. The runcmd module is run after reboot: machine: vm_ssh_test_2_ubuntu
Now here is the reproduced failure. There are no mentions of the runcmd module running prior to reboot, yet there is a log saying config-runcmd already ran, even though this is a fresh Ubuntu VM image. machine: vm_ssh_test_1_ubuntu
My speculation is that there is no marker file written that would indicate whether the /var/lib/cloud/instance/scripts/runcmd has flushed to disk, which would explain that the file is missing a #! shebang.
Here is an example of this intermediate file written by cloud-init when all goes well:
As a side note, the /opt/maxb-vm/first-boot-command.sh is embedded in the image prior to boot, and deterministically always has the shebang, and image is flushed to ensure the write.
Steps to reproduce the problem
This is a race condition and happens rarely, but becomes a blocking issue when it occurs.
Environment details
cloud-init logs
cloud-init_redacted.log