canonical / cloud-init

Official upstream for the cloud-init: cloud instance initialization
https://cloud-init.io/
Other
3.01k stars 887 forks source link

cloud-init regenerating ssh-keys #3753

Closed ubuntu-server-builder closed 1 year ago

ubuntu-server-builder commented 1 year ago

This bug was originally filed in Launchpad as LP: #1885527

Launchpad details
affected_projects = ['cloud-init (Ubuntu)']
assignee = lp-markusschade
assignee_name = Markus Schade
date_closed = 2020-11-24T17:58:47.047974+00:00
date_created = 2020-06-29T08:16:26.428564+00:00
date_fix_committed = 2020-10-29T15:06:45.344303+00:00
date_fix_released = 2020-11-24T17:58:47.047974+00:00
id = 1885527
importance = medium
is_complete = True
lp_url = https://bugs.launchpad.net/cloud-init/+bug/1885527
milestone = None
owner = smoser
owner_name = Scott Moser
private = False
status = fix_released
submitter = hadmut
submitter_name = Hadmut Danisch
tags = []
duplicates = []

Launchpad user Hadmut Danisch(hadmut) wrote on 2020-06-29T08:16:26.428564+00:00

Hi,

I made some experiments with virtual machines with Ubuntu-20.04 at a german cloud provider (Hetzner), who uses cloud-init to initialize machines with a basic setup such as ip and ssh access.

During my installation tests I had to reboot the virtual machines several times after installing or removing packages.

Occassionally (not always) I noticed that the ssh host keys have changed, ssh complained. After accepting the new host keys (insecure!) I found, that all key files in /etc/ssh had fresh mod times, i.e. were freshly regenerated.

This reminds me to a bug I had reported about cloud-init some time ago, where I could not change the host name permanently, because cloud-init reset it to it's initial configuration at every boot time (highly dangerous, because it seemed to reset passwords to their original state as well.

Although cloud-init is intended to do an initial configuration for the first boot only, it seems to remain on the system and – even worse: occasionally – change configurations.

I've never understood what's the purpose of cloud-init remaining active once after the machine is up and running.

ubuntu-server-builder commented 1 year ago

Launchpad user Hadmut Danisch(hadmut) wrote on 2020-06-29T08:36:30.016752+00:00

BTW,

docs at https://cloudinit.readthedocs.io completely fail to tell what cloud-init actually is or is supposed to do.

It is not explaining that or why cloud-init survives the first boot and remains active for future boots, and what this is good for.

There is no warning, no hint, no information that cloud-init keeps continuously twiddeling with the system.

ubuntu-server-builder commented 1 year ago

Launchpad user Scott Moser(smoser) wrote on 2020-06-30T11:58:02.013708+00:00

Hi, please attach the output of 'cloud-init collect-logs' when run on a system that demonstrates the problem.

cloud-init uses the "instance-id" from the metadata service to indicate a new instance. Some things run once per instance, some things run once per boot.

ubuntu-server-builder commented 1 year ago

Launchpad user Scott Moser(smoser) wrote on 2020-06-30T11:58:45.798918+00:00

after replying with collected logs, please set the status back to 'new'. thanks for taking the time to file a bug.

ubuntu-server-builder commented 1 year ago

Launchpad user Valery Tschopp(valery-tschopp) wrote on 2020-07-01T09:35:14.221996+00:00

We have a similar issue:

  1. At first boot cloud-init generated the host ssh keys
  2. The metadata service crashed and went down :(
  3. At reboot, cloud-init can NOT reach the metadata service and regenerates the host ssh keys
ubuntu-server-builder commented 1 year ago

Launchpad user Hadmut Danisch(hadmut) wrote on 2020-07-01T09:54:41.449369+00:00

I currently cannot give logs, since these were only temporary testing machines in a cloud, that existed only for tens of minutes to test installation procedures. I will supply logs as soon as a proceed with testing and the problem occurs again.

However, I do not understand and did not find any documentation about why cloud-init even remains active after first boot.

Descriptions like https://help.ubuntu.com/community/CloudInit or https://cloudinit.readthedocs.io/en/latest/ are just misleading as they suggest, that this is just about the initialization of the machine. They don't tell that cloud-init remains active and keeps manipulating the system.

I found this to be a severy security issue (which I reported in an earlier bug report for 18.04) when I could not permanently change the hostname of a machine, since cloud-init was resetting it with every reboot, and the file, where this was stored, was hidden deeply somewhere in /var. I'm afraid I cannot even change a password, since cloud-init might reset it to it's initial state.

I do consider it as a serious flaw and security problem just that cloud-init is behaving very differently from what's described in the documentation.

AGAIN: Why is cloud-init still manipulating the machine after initialization and first boot?

ubuntu-server-builder commented 1 year ago

Launchpad user Scott Moser(smoser) wrote on 2020-07-01T12:19:26.953246+00:00

"AGAIN: Why is cloud-init still manipulating the machine after initialization and first boot?"

Because cloud-init thinks it is a "first boot". A supported use case for cloud-init is:

cloud-init will recognize that these instances are new instances, and initialize them. It recognizes this by comparing the cached value of 'instance-id' versus the current value of 'instance-id'. If they have changed, then you have a new instance.

The other reason for cloud-init to "remain active" is that it offers "per-boot" things.

ubuntu-server-builder commented 1 year ago

Launchpad user Valery Tschopp(valery-tschopp) wrote on 2020-07-02T10:14:29.881485+00:00

We have no problem about cloud-init still being active on the machine after the first boot init.

My issue is:

On first instance boot, the metadata service is successfully contacted, and the initialisation succeed (host ssh key generated, hostname set, ...)

But on any reboot, if the metadata service is DOWN or not reachable for any reason, then cloud-init regenerates the host ssh keys.

My understanding is that determining if it is a first boot, or not, is only based on the cached instance-id, compared to the data received from the metadata service. So if the metadata service is DOWN or not reachable, cloud-init will always think it is a first boot, right?

Isn't it possible to make this test more robust?

ubuntu-server-builder commented 1 year ago

Launchpad user Scott Moser(smoser) wrote on 2020-07-02T14:42:00.602943+00:00

@Valery,

Some cloud platforms provide the instance id via some non-network channel (dmi data is common). In those cases, cloud-init will check cached value versus the locally-available instance-id before looking for a network available datasource.

So, if Hetzner provides that information in some way, cloud-init can use it.

If not, the only options are to for the user to disable cloud-init (touch /etc/cloud/cloud-init.disabled) or set manual_cache_clean (https://bugs.launchpad.net/cloud-init/+bug/1712680/comments/11).

I'm not really sold on "what if the metadata service is DOWN" argument. Your cloud should not have its important services just fail. If it does, things are going to break. You could make a similar argument "What if DNS server is down?". I'm not discounting "Design for failure", and cloud-init could definitely do better here, but we need some support from the platform (locally available instance-id) to do better without sacrificing design goals.

ubuntu-server-builder commented 1 year ago

Launchpad user Dan Watkins(oddbloke) wrote on 2020-07-07T13:31:17.847471+00:00

If Hetzner has (or starts to provide) a way of determining instance ID without using the network, we'd be more than happy to accept patches to use that in cloud-init. However, as it sounds like the issue here is Hetzner's internal services being unreliable, rather than a cloud-init issue, I'm going to mark this Incomplete. If you think this is unreasonable, please comment and change the status back to New.

Thanks!

ubuntu-server-builder commented 1 year ago

Launchpad user Hadmut Danisch(hadmut) wrote on 2020-07-07T14:00:21.206452+00:00

Yes, I do think this is unreasonable.

It is definitely not Hetzner's task to fix Ubuntu.

Especially since that process of re-initiialization of that instance ID is neither obvious nor documented.

Looking at

https://cloudinit.readthedocs.io/

I did not yet find an explanation of what is going on, and at

https://cloudinit.readthedocs.io/en/latest/topics/instancedata.html

it just says

v1.instance_id

Unique instance_id allocated by the cloud.

Examples output:

i-<hash>

but does not give a hint that this is to be constantly provided by some internal service.

On the contrary, it says

"Cloud-init is the industry standard multi-distribution method for cross-platform cloud instance initialization. It is supported across all major public cloud providers, provisioning systems for private cloud infrastructure, and bare-metal installations."

It says „instance initalization”. It does not say that is keeps modifying the living instance.

So this is undocumented behaviour, and I am more and more thinking about the question, whether this is a backdoor.

ubuntu-server-builder commented 1 year ago

Launchpad user Dan Watkins(oddbloke) wrote on 2020-07-13T18:48:19.586575+00:00

It is definitely not Hetzner's task to fix Ubuntu.

To be clear, cloud-init is not used only on Ubuntu; I believe that Hetzner's outage would have this effect across the majority of Linux distributions.

And, that aside, I don't think this characterisation is fair: we're suggesting that if Hetzner are going to allow their internal services to go down, then they should provide a more reliable way for instances to determine their identity. (This is generally done via DMI in other clouds that do it. The hypervisor stores the instance ID and provides it as a DMI value, and obviously instances can only boot if the hypervisor is up; therefore, the instance ID is always available.) To state this more glibly (and therefore less helpfully): it is not cloud-init's task to fix Hetzner.

That said, perhaps there is something that the Hetzner data source could do to handle this Hetzner-specific case. We already perform 60 retries with a 2 second wait between them, and a 2 second timeout. So we allow at least 2 minutes for the services to respond with something before we give up; we could bump that but I don't think it addresses the underlying issue. Any thoughts would be appreciated.

Alternatively (or perhaps additionally), this may need a change in the instance ID model that cloud-init uses to handle an explicit "we are not currently able to determine instance ID, so assume it hasn't changed". I think, however, that this would lead to a converse problem: instances launched from instance-capture images which boot for the first time during an "instance ID outage" would not detect that they were new instances, and so would not perform their first boot customisation. This would result in potentially-inaccessible instances (if any credentials remaining in the image are not available to the user launching instances) with SSH host keys not rotated (meaning that they would all have the same host keys as the image; a security issue). Of course, if users are also relying on their cloud-init user-data to perform any actions, that also won't occur; depending on their threat model, some users might also consider this a security issue.

The ultimate problem is that cloud-init cannot determine when it runs within an instance whether or not this is a "first boot": the cloud needs to indicate to us one way or the other, which is done via instance ID. If the cloud cannot do that, then there is no way to determine the correct behaviour.

If you are certain that you will never be capturing instances as images (i.e. you can categorically say that the root filesystem in this instance will never first boot again) and you aren't using any of cloud-init's functionality after first boot (e.g. per-boot scripts), then you can disable cloud-init in the ways described by Scott earlier in this bug.

One convenience we could potentially provide: if cloud-init had a way for image creators to express "when next launched, cloud-init should treat that instance ID as immutable and permanent" (in a way that could be undone on subsequent boots, if a user wants to "unfreeze" an instance for image capture) then we might be able to avoid some of this pain, but that idea would need more fleshing out before it's clear if it even makes sense.

Especially since that process of re-initiialization of that instance ID is neither obvious nor documented.

Agreed on both counts; cloud-init's documentation is lacking in many respects, including this one.

ubuntu-server-builder commented 1 year ago

Launchpad user Scott Moser(smoser) wrote on 2020-07-24T14:10:03.713630+00:00

@Daniel,

One convenience we could potentially provide: if cloud-init had a way for image creators to express "when next launched, cloud-init should treat that instance ID as immutable and permanent" (in a way that could be undone on subsequent boots, if a user wants to "unfreeze" an instance for image capture) then we might be able to avoid some of this pain, but that idea would need more fleshing out before it's clear if it even makes sense.

I think you're basically describing "manual_cache_clean".

The intent (testing is needed) for manual_cache_clean is that

a.) user-data and system config (/etc/cloud/*.cfg) can set manual_cache_clean to true or false. As always user-data overrides system config. vendor-data should also be able to provide the setting.

b.) cloud-init renders /var/lib/cloud/instance/manual-clean (path_helper.get_ipath_cur("manual_clean_marker")) if

c.) on boot, both ds-identify and cloud-init will check and respect existance of /var/lib/cloud/instance/manual-clean

So... "unfreeze", if manual_cache_clean was set is just: rm -Rf /var/lib/cloud/instance /var/lib/cloud/instance/

I think it would be good to both test that my intent/understanding are correct, and document it.

ubuntu-server-builder commented 1 year ago

Launchpad user Scott Moser(smoser) wrote on 2020-07-24T14:19:21.171832+00:00

I opened bug 1888858 to request better documentation on this feature.

ubuntu-server-builder commented 1 year ago

Launchpad user Launchpad Janitor(janitor) wrote on 2020-09-23T04:17:23.124496+00:00

[Expired for cloud-init (Ubuntu) because there has been no activity for 60 days.]

ubuntu-server-builder commented 1 year ago

Launchpad user Hadmut Danisch(hadmut) wrote on 2020-09-23T12:48:31.812480+00:00

I've found the problem.

The cloud provider (in this case german Hetzner) provides a machine with a virtual ethernet interface

lspci | fgrep Ethernet

00:03.0 Ethernet controller: Red Hat, Inc. Virtio network device

which is, following the naming standards, named ens3

But then, the provider gives an cloud-init-file, which cloud-init fetches by http (I have not yet fully understood the after-boot-behaviour of cloud-init), which gives the network-configuration as

network-config: config:

thus renaming the device from ens3 to eth0

Therefore kern.log contains entries like

Sep 23 14:13:44 worker-00 kernel: [ 1.372316] virtio_net virtio0 ens3: renamed from eth0 Sep 23 14:13:44 worker-00 kernel: [ 4.834687] virtio_net virtio0 eth0: renamed from ens3

proving that the network interface changes between ens3 and eth0.

This then sometimes (logs taken from another machine where that problem occured) results in the cloud-init.log

2020-09-22 15:00:15,611 - init.py[DEBUG]: Attempting setup of ephemeral network on ens3 with 169.254.0.1/16 brd 169.254.255.255 2020-09-22 15:00:15,611 - subp.py[DEBUG]: Running command ['ip', '-family', 'inet', 'addr', 'add', '169.254.0.1/16', 'broadcast', '169.254.255.255', 'dev', 'ens3'] with allowed return codes [0] (shell=False, capture=True) 2020-09-22 15:00:15,637 - handlers.py[DEBUG]: finish: init-local/search-Hetzner: FAIL: no local data found from DataSourceHetzner 2020-09-22 15:00:15,637 - util.py[WARNING]: Getting data from <class 'cloudinit.sources.DataSourceHetzner.DataSourceHetzner'> failed 2020-09-22 15:00:15,639 - util.py[DEBUG]: Getting data from <class 'cloudinit.sources.DataSourceHetzner.DataSourceHetzner'> failed Traceback (most recent call last): File "/usr/lib/python3/dist-packages/cloudinit/sources/init.py", line 770, in find_source if s.update_metadata([EventType.BOOT_NEW_INSTANCE]): File "/usr/lib/python3/dist-packages/cloudinit/sources/init.py", line 659, in update_metadata result = self.get_data() File "/usr/lib/python3/dist-packages/cloudinit/sources/DataSourceHetzner.py", line 53, in get_data with cloudnet.EphemeralIPv4Network(nic, "169.254.0.1", 16, File "/usr/lib/python3/dist-packages/cloudinit/net/init.py", line 990, in enter self._bringup_device() File "/usr/lib/python3/dist-packages/cloudinit/net/init.py", line 1027, in _bringup_device subp.subp( File "/usr/lib/python3/dist-packages/cloudinit/subp.py", line 291, in subp raise ProcessExecutionError(stdout=out, stderr=err, cloudinit.subp.ProcessExecutionError: Unexpected error while running command. Command: ['ip', '-family', 'inet', 'addr', 'add', '169.254.0.1/16', 'broadcast', '169.254.255.255', 'dev', 'ens3'] Exit code: 1 Reason: - Stdout: Stderr: Cannot find device "ens3" 2020-09-22 15:00:15,646 - main.py[DEBUG]: No local datasource found

So it seems to be a timing problem.

If the cloud-init-configuration contains a different interface name that the natural standard name (here: eth0 instead of ens3), then it can sometimes (rarely, but experienced several times) happen that cloud-init does not find the interface ens3, because it has been renamed eth0, thus can't initialize it, can't fetch the init file and then for some strange reason believes it should refresh itself.

ubuntu-server-builder commented 1 year ago

Launchpad user Launchpad Janitor(janitor) wrote on 2020-09-23T16:25:07.747261+00:00

Status changed to 'Confirmed' because the bug affects multiple users.

ubuntu-server-builder commented 1 year ago

Launchpad user Markus Schade(lp-markusschade) wrote on 2020-10-26T09:42:24.801727+00:00

Hetzner also provides instance ID via DMI information (system-serial-number). So this could be used as a fallback should the metadata service not respond

ubuntu-server-builder commented 1 year ago

Launchpad user Scott Moser(smoser) wrote on 2020-10-26T13:32:52.051516+00:00

@Markus, Can you please provide a link to documentation showing that?

ubuntu-server-builder commented 1 year ago

Launchpad user Markus Schade(lp-markusschade) wrote on 2020-10-26T14:52:45.236447+00:00

I co-wrote the datasource. ;-)

I will prepare a MR to add the following to our DS:

def check_instance_id(self, sys_cfg):
    return sources.instance_id_matches_system_uuid(
        self.get_instance_id(), 'system-serial-number')

That should take care of those cases and prevent an already configured instance from being reconfigured in case the metadata server does not respond properly.

ubuntu-server-builder commented 1 year ago

Launchpad user Markus Schade(lp-markusschade) wrote on 2020-10-27T12:54:11.033693+00:00

https://github.com/canonical/cloud-init/pull/630

ubuntu-server-builder commented 1 year ago

Launchpad user Markus Schade(lp-markusschade) wrote on 2020-11-04T09:27:06.357131+00:00

New images with the patched datasource have been rolled out on the platform.

ubuntu-server-builder commented 1 year ago

Launchpad user Chad Smith(chad.smith) wrote on 2020-11-24T17:58:49.131970+00:00

This bug is believed to be fixed in cloud-init in version 20.4. If this is still a problem for you, please make a comment and set the state back to New

Thank you.

ubuntu-server-builder commented 1 year ago

Launchpad user Launchpad Janitor(janitor) wrote on 2020-11-25T01:14:41.910896+00:00

This bug was fixed in the package cloud-init - 20.4-0ubuntu1


cloud-init (20.4-0ubuntu1) hirsute; urgency=medium