canonical / cloud-init

Official upstream for the cloud-init: cloud instance initialization
https://cloud-init.io/
Other
2.9k stars 862 forks source link

[question] How is cloud-init growpart able to resize LUKS (encrypted) volumes? #4693

Open captainfalcon23 opened 9 months ago

captainfalcon23 commented 9 months ago

Hello!

Apologies if this is not the correct place to be asking this, but I have been scratching my head for some time and can't figure out what's going on in this scenario (or a better place to ask).

This is directly in relation to https://github.com/canonical/cloud-init/pull/1316 and to a lesser extent https://github.com/canonical/cloud-init/pull/1032

I recently created a fresh install of Oracle Linux 9.3, and I was surprised to find out that cloud init was able to automatically extend the volume and partition size on boot. I delved into the source code e.g. https://github.com/TheRealFalcon/cloud-init/blob/3ce1e06ac6c8b1c0bb09f437695d9bf1024ed412/cloudinit/config/cc_growpart.py on line 108, there is a reference to this:

KEYDATA_PATH = Path("/cc_growpart_keydata")

I cannot fathom, nor find any documentation, where this file comes from, what creates or deletes it and how the developer of this PR knew about its existence. A google search of "cc_growpart_keydata" only yields these results:

The main reasons why I want to understand this are:

  1. When cloud-init extends the partition in this way, it is not effective till the next reboot
  2. I am now genuinely curious how this all hangs together, and potentially think about reimplementing it myself outside of cloud-init.

TIA for any advice and support.

EDIT:

Upon some more digging, I found https://gitlab.com/vkuznets/encrypt-rhel-image/-/blob/master/encrypt-rhel-image.py?ref_type=heads which seems to indicate it is creating the key for cloud-init, but this script isn't on my server.

aciba90 commented 9 months ago

Hey, @captainfalcon23.

/cc_growpart_keydat is not documented, afaict. I think it is a way to communicate what's the key to cloud-init, because there were issues extracting it from the kernel. https://github.com/canonical/cloud-init/pull/1032 has more discussion about it.

This is the PR introduced the feature. And the code lives here: cloudinit/config/cc_growpart

For more questions, please reach out in the #cloud-init channel in Librera.

holmanb commented 9 months ago

When cloud-init extends the partition in this way, it is not effective till the next reboot

@captainfalcon23 Why do you say that?

captainfalcon23 commented 9 months ago

Thanks @aciba90, I will see if I can continue the convo there. I did review those PRs, but it's still not clear what is putting the key into /cc_growpart_keydata. cc_growpart is just expecting it to be there. I am curious on the how it gets there.

@holmanb Well, on my test server, these are the steps I took (noting this is in on prem VMware/vCenter):

  1. Create a VM from a template (template was created from a fresh install of OL9.3)
  2. Increase the disk by some amount
  3. Boot the VM
  4. Login and check the disk and partition size using lsblk and df -khP and confirm the size shown is the same as the original disk size before modification.
  5. Reboot
  6. Run the same commands and confirm the disk and partition got extended
captainfalcon23 commented 9 months ago

So I ended up wasting more time on this, added so much logging to cc_growpart.py and trawled through logs, trying to catch what ever key it was using at /cc_growpart_keydata. After much searching and identifying in the logs that the python was never hitting the method to extend the LUKS partion, I decided to completely mask and disable cloud-init and all related services and test to see what happened.

So this time after a reboot and increasing the disk, the partition was NOT grown to use this space (hence, at least I can confirm cloud-init was handling this). After manually extending the partition using growpart /dev/sda 3 and then rebooting again, my LUKS container has grown to fill the entire disk!

I did some further digging and found this https://unix.stackexchange.com/a/685586 :

The LUKS header doesn't include the partition size and the partition is encrypted block by block. So when you extend the encrypted partition size, it should automatically extend the size of the mapped (unencrypted) partition.

However I'm not sure if LUKS will detect the change on mounted partitions. You might need to instruct it to resize active mappings with:

cryptsetup resize <mapping name>

*** Alternatively you could just close and re-open the mappings or reboot your system. ***

I can't find anything more definitive after 10 more minutes of googling, but I think that clearly explains the behaviour.

However, this opens the can of "why is there this completely undocumented feature of cloud-init"? It seems the module just assumes someone is meant to know the importance of /cc_growpart_keydata, what to insert into it, where to put it and what effect it will have. Perhaps this issue can be updated to reflect that an update to the docs is needed?

TheRealFalcon commented 9 months ago

It seems the module just assumes someone is meant to know the importance of /cc_growpart_keydata, what to insert into it, where to put it and what effect it will have.

This is what is happening. Currently, the only supported consumer of this code are the Azure confidential compute images. The slot and key are added during provisioning for cloud-init to consume during the growpart phase. It has only been tested on Azure specific images produced by Canonical.

While the code could theoretically be used in other contexts, it hasn't been designed or tested for more general-purpose use cases. This is why it is undocumented.

More context for what is happening is available at https://github.com/canonical/cloud-init/pull/1032#discussion_r737865392

captainfalcon23 commented 9 months ago

Thanks @TheRealFalcon. What is special about those Azure confidential images which makes them the only consumer of this service?

It would be great if this can be documented and expanded for other consumers usage. After spending so much time going through the code, I don't think it is too difficult to document and use in a different content. It would just require the user to crete the key and slot, ready for cloud-init to use.

dermotbradley commented 9 months ago

@captainfalcon23 I've been working on cc_growpart over the holidays to address some issues relating to LUKS and LVM support.

However, this opens the can of "why is there this completely undocumented feature of cloud-init"? It seems the module just assumes someone is meant to know the importance of /cc_growpart_keydata, what to insert into it, where to put it and what effect it will have. Perhaps this issue can be updated to reflect that an update to the docs is needed?

As part of the PR i'm working on I expect to make some changes to the docs. Basically a keyfile is not required in all circumstances where the LUKS volume is being grown, only in certain circumstances, but both the current code and docs does not address these scenarios.

In short, a keyfile is not required for growing most LUKSv1 volumes, and is likely required in most (but not all) circumstances for growing LUKSv2 volumes, this underlying issue is nothing specific to Azure.

captainfalcon23 commented 9 months ago

Sounds good @dermotbradley. Keen to checkout the PR once done, and it can likely close this issue :)