canonical / lxd

Powerful system container and virtual machine manager
https://canonical.com/lxd
GNU Affero General Public License v3.0
4.32k stars 926 forks source link

UUID based `instance-id` for cloud-init (reset on changes) #9814

Closed blackboxsw closed 2 years ago

blackboxsw commented 2 years ago

Issue description

Feature request: Requirements:

Currently LXD publishes a static metadata.instance-id which matches the hostname via the dev LXD API.

Cloud-init would like the instance-id to change at /dev/lxd/sock /1.0/meta-data anytime the underlying container or vm configuration changes which tells cloud-init to reconfigure the system on next boot:

Steps to reproduce (once feature is implemented)

  1. lxc launch ubuntu-daily:jammy test-instance-id-updates
  2. lxc exec test-instance-id-updates -- cloud-init status --wait --long
  3. lxc exec test-instance-id-updates -- curl --unix-socket /dev/lxd/sock http://x/1.0/meta-data # check current instance-id
  4. lxc config set test-instance-id-updates cloud-init.user-data
  5. lxc exec test-instance-id-updates -- curl --unix-socket /dev/lxd/sock http://x/1.0/meta-data # assert instance-id changed
  6. lxc config set test-instance-id-updates cloud-init.network-config
  7. lxc exec test-instance-id-updates -- curl --unix-socket /dev/lxd/sock http://x/1.0/meta-data # assert instance-id changed
  8. lxc config set test-instance-id-updates cloud-init.vendor-data
  9. lxc exec test-instance-id-updates -- curl --unix-socket /dev/lxd/sock http://x/1.0/meta-data # assert instance-id changed

background

On many clouds, cloud-init treats changes to meta-data.instance-id as an dirty-cache check from the platform that cloud-init needs to re-run initial system configuration because network config, user-data or meta-data on the system has changed which requires cloud-init to re-run as if it were a new deployment. Many clouds instrument meta-data.instance-id as a UUID that changes when underlying network, meta-data, vendor-data or user-data change for a given VM.

The new LXDDatasource in cloud-init consumes updated system instance-data from the LXD dev API @ /dev/lxd/sock and it will check forinstance-id changes across reboot in order to determine of a config update is needed.

Required information

Information to attach

stgraber commented 2 years ago

Okay, so having a UUID which then changes if:

That part is pretty uncontroversial and shouldn't be too hard to achieve. I'm not sure that we hit Update() when the name changes so we may need to do some special casing on that one to properly reset on copy and renames.

Where it gets a bit more hairy is with the network devices especially when profiles get mixed in. We don't want someone doing lxc profile device set default eth0 user.foo blah to trigger a new UUID for every single instance LXD-wide.

So what exactly does cloud-init care about there? We can only do it on network device being added or remove, but then is an existing device being changed to a different MAC and different name inside of the instance going to cause problems if it doesn't reset the UUID too?

blackboxsw commented 2 years ago

Agreed on your header of 3 use-cases:

--- Additional instance-id update requests per your last comment

and different name inside of the instance going to cause problems if it doesn't reset the UUID too

Yes this would break cloud-init's generated network match as it is based on NIC name exposed to the instance.

From LXD dev API 1.0/devices output[1] we don't get MAC addr currently cloud-init will rely only on device name of type: nic to render a generic network config on the system without a match on MAC. As long as we know the desired device name we should be in good shape here.

[1] LXD API devices output

$ lxc exec test-it -- curl --unix-socket /dev/lxd/sock http://x/1.0/devices
{"eth0":{"name":"eth0","nictype":"bridged","parent":"lxdbr0","type":"nic"},"root":{"path":"/","pool":"default","type":"disk"}}
lxc exec test-it -- cat /etc/netplan/50-cloud-init.yaml
$ lxc exec test-it -- cat /etc/netplan/50-cloud-init.yaml
# This file is generated from information provided by the datasource.  Changes
# to it will not persist across an instance reboot.  To disable cloud-init's
# network configuration capabilities, write a file
# /etc/cloud/cloud.cfg.d/99-disable-network-config.cfg with the following:
# network: {config: disabled}
network:
    version: 2
    ethernets:
        eth0:
            dhcp4: true

At some point (cloud-init needs to file a separate issue) cloud-init would like client-side visibility to dhcp6 vs dhcp4 needs somehow via LXDAPI but maybe that comes in 1.0/networks/{name}/leases?

stgraber commented 2 years ago

Ok, so we introduce a new volatile.cloud-init.machine-id which gets auto-reset if:

@blackboxsw does that cover it?

To clarify that last one, it means that adding or removing an interface would trigger it, as would changing the name property on an existing device. However if someone is to remove a device and introduce another one at the same time and using the same instance-visible name, we won't trigger it. That should line up with what cloud-init cares about.

blackboxsw commented 2 years ago

Yes the internal accounting of volatile.cloud-init.machine-id and the trigger cases you mention should suffice for all cloud-init needs. As long as the LXD dev API 1.0/meta-data will reflect instance-id based on the value of volatile.cloud-init.machine-id without caching getting in the way we should be good here.