Current situation

Kata CCv0 implementation will mount boot image as ro mode currently. For example, when the boot image is mounted as a virtio-blk-pci block device, Kata's QEMU command contains the following readonly option:

-device virtio-blk-pci,disable-modern=false,drive=image-6ffe5ef1861a8e27,scsi=off,config-wce=off,share-rw=on,serial=image-6ffe5ef1861a8e27 
-drive id=image-6ffe5ef1861a8e27,file=/root/demo/bin_katacc/ubuntu/kata-containers.img,aio=threads,format=raw,if=none,readonly

And QEMU also passes the ro designation parameter to the kernel:

root=/dev/vda1 rootflags=data=ordered,errors=remount-ro ro rootfstype=ext4

But Kata sandbox needs to download and save some temporary files (for example: user container image) when it's running. Currently these files are stored in guest OS tmpfs type folders (for example: container images are stored in /run/kata-containers/xxxxxx (xxxxxx: container ID)).

Problem: tmpfs belongs to the private memory of the guest OS. Although it can ensure that the contents of the tmpfs type folder will not leak to the host side, the size of these folders are limited by the memory size of the Pod, so these folders shouldn't be used to store relatively large files (for example: large container image or multiple container images). And confidential memory is also expensive. As a result, it's necessary to provide ephemeral storage to save those files, and at the same time, it is necessary to ensure that the data on the ephemeral storage device won't be exposed to the host side in plaintext.

Target

In Kata CCv1, it needs to provide ephemeral storage with confidentiality protection (and optionally integrity) for decrypted layer plaintext data, metadata, and container bundles at the Pod level, which means these contents are all encrypted when accessed by the host. At the same time, different rootfs's mounting methods are provided according to the actual deployment situation.

Proposal

According to the CCv1 threat model discussion, rootfs image types are divided into:

Protected boot image: The tenant provides the encrypted boot image and also controls the decryption key.
Unprotected boot image: A plaintext integrity-protected boot image is provided by the CSP.

From security perspective, mounting secure writable paths in both Protected and Unprotected cases works essentially the same: use an overlayfs to mount an extra layer of encrypted filesystem on top of the paths (such as /run/xxx) that need rw permission. And other components of rootfs will be still leaved as "read-only".

Approximate solution should be: Pod Create:

The kata runtime passes two block device options to QEMU:
- Protected boot image/Unprotected boot image corresponding block device
- An empty, size configurable block device (disk.qcow2) as the extra layer of encrypted filesystem of overlayfs
Start the Guest Kernel and mount the Protected boot image/Unprotected boot image to rootfs in "read-only" mode
- For Protected boot image: it's mounted after TDVF/TDSHIM performs remote attestation, obtains the Protected boot image's decryption key from KBS.
- For Unprotected boot image: It's mounted after Integrity verification of the Unprotected boot image (as example: using dm-verity)
Encrypt the disk.qcow2 corresponding block device (for example: using dm-crypt), format and mount it as the top layer of overlayfs on paths that need writable permission.
After that, Guest OS can store all new/temporary files in the encrypted disk.qcow2 block device and the host side can only view the ciphertext of these files.

Pod Destroy:

The encrypted block device's image (disk.qcow2) should also be wiped and destroyed simultaneously when pod is destroyed.

Assumption: Both Protected boot image and Unprotected boot image will be mounted in ro mode. Note: This proposal chooses virtio-blk + encryption soultion because per-file encryption over virtiofs will expose too much potentially sensitive metadata. Note: We need two block devices because the boot device is immutable, remains identical from one boot to the next, and may be shared across pods, while the encrypted disk is private to each pod, writable, and can be erased between boots

Questions

Which format should be used by the encrypted block device? We used qcow2 as demo in this proposal which has benefits that it can take much less space when the disk is far from full. But it exists risk of running out of host disk space as the qcow2 grows which would suspend the guest.

dm-crypt Experiment

The following experiments verify that a raw block device can be added to the VM sandbox, and use dm-crypt to protect that block devie. It also verified that the dm-crypt encrypted block device cannot be mounted again by the host/guest directly.

Create block device image

Create a qcow2 format device image with following command:

qemu-img create -f qcow2 ./disk.qcow2 1G

Create sandbox

Create sandbox with disk.qcow2 file block device, following QEMU options should be addded in the kata CCv0 runtime:

  -device virtio-blk-pci,disable-modern=false,drive=image-000000,scsi=off,config-wce=off,share-rw=on,serial=image-000000
  -drive id=image-000000,file="/root/disk.qcow2",aio=threads,format=raw,if=none

After sandbox is launched successfully, a new vdb device can be found in /dev/.

Mount and format block device

Use followimg dm-crypt command to encrypt & mount the block device inside guest OS:

cryptsetup -y luksFormat /dev/vdb  # this command will notify to input password
cryptsetup luksOpen /dev/vdb encrypted_disk

mkfs.ext4 /dev/mapper/encrypted_disk
mkdir /mnt/enc
mount /dev/mapper/encrypted_disk /mnt/enc

Can create a new file inside /mnt/enc/ path:

echo "This is Part 3 of a 12-article series about the LFCE certification" > /mnt/enc/testfile.txt

Verification

The following verifications were executed in host and guest OS, and it is determined that neither of them can successfully mount the dm-crypt encrypted block device.

mount the disk.img in host OS diretly
Mount the block device (/dev/vdb) in guest OS directly.

They all failed with following message: "unknow filesystem type 'crypto_LUKS'"

This experiment can also be used as test steps to verify the feasibility of POC solution in the future.

Thanks for the writeup and experiment report, @liangzhou121. Great work.

First, comment is on the content:

Please explain why having two block devices passed to qemu is desirable. Technically, you could have a single image with device mapper structure inside, like when a laptop boots with encrypted disk, where there is only one disk. My understanding is that you do that so that the ro filesystem can be shared across multiple instances. Can you think of any other reason worth mentioning?
Please explain why you want to mount with overlayfs rather than plain mount or switch_root. If all you need is to store images somewhere in your filesystem, then a plain mount is enough. On the other hand, overlayfs would enables changes to your /bin, /sbin or /lib, which I don't see any reason to allow.
Please document that we ruled out per-file encryption over virtiofs because we think it would expose too much meta-data. So when you document things with virtio-blk, it's really not an option but a necessity.
Why did you describe the downloaded images as "temporary"? Shouldn't we preserve this across reboots? In other words, isn't it desirable to be able to reboot with the same overlay and get back the previously downloaded images?
How do you initialize the secondary filesystem? I believe from your examples that you write zeroes in it, but those would not appears as zeroes in the guest due to encryption. If you were unlucky enough, those could be decoded using the tenant key as a valid, non-empty filesystem, no? (Admittedly improbable, but if the assumption is that it will not happen for statistical reasons, this has to be documented).
Relatedly, in your experiment, you see a message "unknow filesystem type 'crypto_LUKS'", which to me means that the host sees at least some meta-data of the guest disk in plain text. So could the host pass a bogus partition table or some other disk-level metadata to send the guest down a bogus path?

Second, nitpicking comments (typos, wording, etc):

Kata CCv0 implemenation

implemenation -> implementation

Although it can ensure that the contents of the tmpfs type folder will not leak to the host side, but the size of these folders are limited by the memory size of the Pod

"Although... but" seems incorrect to me. I think "but" should be removed.

(Also, another important argument is that this confidential memory is expensive, relative to good old disks)

As a result, it's necessary to use persistent storage to save those files,

"use" -> "provide" ?

Protected boot image: The tenant provides the encrypted boot image, and the tenant controls the decryption key.

repeated "the tenant" -> "and also controls the decryption key" ?

Unprotected boot image: A plaintext boot image is provided by the CSP which is protected with integrity.

I kept reading that as the CSP being protected, what about "A plaintext integrity-protected boot image is provided by the CSP"?

From security perspective, the mechanism of mounting Protected boot image/Unprotected boot image to a secure rootfs with writable permission are basically the same, that is, use overlayfs to mount an extra layer of encrypted filesystem with writable permissions on the mounted ro Protected boot image/Unprotected boot image rootfs.

I would rephrase that as: "Mounting a secure writable rootfs in both Protected and Unprotected cases works essentially the same: we use an overlayfs to mount an extra layer of encrypted filesystem on top of the initial rootfs, that extra layer being writable unlike the underlying rootfs, which remains mounted read-only."

QEMU needs to pass two block devices at startup:

Either "the runtime passes two block device options to qemu", or "qemu needs to be given two block devices at startup" (in other words, it's not qemu that passes them)

For Unprotected boot image: It's mounted after Integrity verification of the Unprotected boot image (as exapmle: using dm-verity)

exapmle -> example

Hope this helps.

Hi @c3d, thank you for your good comments and suggestions. Followings are my answers to your questions:

Please explain why having two block devices passed to qemu is desirable. Technically, you could have a single image with device mapper structure inside, like when a laptop boots with encrypted disk, where there is only one disk. My understanding is that you do that so that the ro filesystem can be shared across multiple instances. Can you think of any other reason worth mentioning?

Yes, the main reason is the rootfs block device should be mounted as ro to let rootfs boot image keep unchanged during sandbox's execution. Becasue sandbox's status should always identical after reboot.

For "laptop boots with encrypted disk" scenario, I think the encrypted disk was mounted as rw permission. So user's changes will be stored into the encrypted disk directly. If Kata also needs to support Protected boot image + rw option, only the boot image corresponding block device is needed to pass by QEMU(as example).

Please explain why you want to mount with overlayfs rather than plain mount or switch_root. If all you need is to store images somewhere in your filesystem, then a plain mount is enough. On the other hand, overlayfs would enables changes to your /bin, /sbin or /lib, which I don't see any reason to allow.

At start, we also considered to plain mount encrypted block device to /run/kata-containers/ to store container images. But after internal discussion, we concluded it isn't a general solution. As example, if another path also need to store some big files in future, then a separately encrypted block device need to plain mount to that path. It will cause the system too complicated.

Yes, the overlayfs would enables changes to /bin, /sbin or /lib. But these changes will be stored in the decrypted block device. So it should not cause security risk. Maybe we can also set these selected folders (/bin, /sbin or /lib) to ro again after overlayfs is mounted successfully.

And with this overlayfs + encrypted block device solution, we think we can reach security storage functionality with minimal changes to other kata related components.

Please document that we ruled out per-file encryption over virtiofs because we think it would expose too much meta-data. So when you document things with virtio-blk, it's really not an option but a necessity.

OK. I will add following description: This proposal chooses virtio-blk + encryption soultion because per-file encryption over virtiofs will expose too much potentially sensitive metadata.

Why did you describe the downloaded images as "temporary"? Shouldn't we preserve this across reboots? In other words, isn't it desirable to be able to reboot with the same overlay and get back the previously downloaded images?

Yes, I think the downloaded images can be stored in encrypted block device which used as top overlay of rootfs. But from security perspective, these contents shoudn't be stored in that block device persistently. If so, we can't ensure the sandboxs' status are always both identical and as expected after reboot.

How do you initialize the secondary filesystem? I believe from your examples that you write zeroes in it, but those would not appears as zeroes in the guest due to encryption. If you were unlucky enough, those could be decoded using the tenant key as a valid, non-empty filesystem, no? (Admittedly improbable, but if the assumption is that it will not happen for statistical reasons, this has to be documented).

The dd command (create disk.img with zeroes) is only a demo about how to create a new disk image file in Host OS. In Guest OS, this secondary filesystem's will be mounted and initialized again with following commands before overlayfs mount.

cryptsetup -y luksFormat /dev/vdb     # this step will set encrypt/decrypt password of the block device
cryptsetup luksOpen /dev/vdb encrypted_disk   # this will decrypt the block device and mout to /dev/mapper/encrypted_disk
mkfs.ext4 /dev/mapper/encrypted_disk    # format this decrypt block device, this step can ensure the device will be initialized as expected.

Relatedly, in your experiment, you see a message "unknow filesystem type 'crypto_LUKS'", which to me means that the host sees at least some meta-data of the guest disk in plain text. So could the host pass a bogus partition table or some other disk-level metadata to send the guest down a bogus path?

I think the plain text only notify this block deivce is encrypted by LUKS and it can't be mounted directly. The cryptsetup and LUKS had been upstreamed for several years. So I think this notification shouldn't expose security risk here.

Hi @c3d, thank you for your good comments and suggestions. Followings are my answers to your questions:

Please explain why having two block devices passed to qemu is desirable. Technically, you could have a single image with device mapper structure inside, like when a laptop boots with encrypted disk, where there is only one disk. My understanding is that you do that so that the ro filesystem can be shared across multiple instances. Can you think of any other reason worth mentioning?

Yes, the main reason is the rootfs block device should be mounted as ro to let rootfs boot image keep unchanged during sandbox's execution. Becasue sandbox's status should always identical after reboot.

You could perfectly have a single disk image with multiple partitions and a device mapper structure on it, some partitions being mounted read-only, some being mounted read-write. However, that would not be convenient since you would need to rebuild the image each time the rw portion changes. So I think that "shared across multiple instances" (or multiple reboots) is really key. Could you please add that to your formulation of the problem?

Please explain why you want to mount with overlayfs rather than plain mount or switch_root. If all you need is to store images somewhere in your filesystem, then a plain mount is enough. On the other hand, overlayfs would enables changes to your /bin, /sbin or /lib, which I don't see any reason to allow.

At start, we also considered to plain mount encrypted block device to /run/kata-containers/ to store container images. But after internal discussion, we concluded it isn't a general solution. As example, if another path also need to store some big files in future, then a separately encrypted block device need to plain mount to that path. It will cause the system too complicated.

I respectfully disagree. A bind mount from multiple locations in the file system to your encrypted filesystem is quite trivial. Actually, in your example, you have only demonstrated mounting an encrypted filesystem in the sandbox, but you have not used overlayfs AFAICT. To complete the setup, you would need a bind mount from /mnt/enc/run/kata-containers to /run/kata-containers and if you wanted to also be able to write some stuff under /var/, you would have another bind mount from /mnt/enc/var to /var.

Using overlayfs on top of the initial rootfs would not work, for two reasons:

overlayfs by itself does not add encryption that I know of. It's really the upperdir that would be mounted from an encrypted filesystem, but then why not use the upperdir directly (or a bind mount if you need multiple mount points)?
It looks like you are suggesting either a switch_root to an overlayfs that contains the original root, or a mount of an overlayfs to /, and both seem rather risky, complicated and unwise from a security standpoint (see below).

That being said, out of curiosity, I created an overlayfs and mounted it over /, and surprisingly, it works. Specifically, here is what I did (the first mount succeeds, expectedly, the second one too, and that was somewhat surprising, all the more that I specifically did not use a -o remount or switch_root):

# mkdir /fake-root
# mount -o bind / /fake-root      
# mkdir /fake-root-2 /fake-root-3
# mkdir /fake-root-2/upper /fake-root-2/workdir
# echo "Additional file" > /fake-root-2/upper/toto
#  mount -t overlay -o lowerdir=/fake-root,upperdir=/fake-root-2/upper,workdir=/fake-root-2/workdir  none  /fake-root-3
#  mount -t overlay -o lowerdir=/fake-root,upperdir=/fake-root-2/upper,workdir=/fake-root-2/workdir  none  /        
# mount
none on /fake-root type overlay (rw,relatime,seclabel,lowerdir=/fake-root,upperdir=/fake-root-2/upper,workdir=/fake-root-2/workdir)

Yes, the overlayfs would enables changes to /bin, /sbin or /lib. But these changes will be stored in the decrypted block device. So it should not cause security risk. Maybe we can also set these selected folders (/bin, /sbin or /lib) to ro again after overlayfs is mounted successfully.

Here too, I respectfully disagree. Protecting the root filesystem from uncontrolled modifications is a key component of modern system security. From Red Hat Core OS to macOS, most systems have made it harder and harder to modify key system components. Many vulnerabilities in the past worked by replacing one component you had access too with a malicious one, and then causing some other (privileged) component to use the malicious replacement. Disk encryption would absolutely not protect you from any of these attacks, only keeping the root filesystem read only would.

And with this overlayfs + encrypted block device solution, we think we can reach security storage functionality with minimal changes to other kata related components.

Don't get me wrong, I believe we need an overlayfs for the images, I just want to make it clear that it's on top of a regular mount for the encrypted disk, not on top of the rootfs.

My understanding at the moment is that you mostly care about having an overlayfs mount for images. Any code that depends on overlayfs needs to be modified extensively anyway, since we move that overlayfs from host to guest, so it's a different Kata component that needs to take care of it (e.g. agent instead of runtime).

Please document that we ruled out per-file encryption over virtiofs because we think it would expose too much meta-data. So when you document things with virtio-blk, it's really not an option but a necessity.

OK. I will add following description: This proposal chooses virtio-blk + encryption soultion because per-file encryption over virtiofs will expose too much potentially sensitive metadata.

Why did you describe the downloaded images as "temporary"? Shouldn't we preserve this across reboots? In other words, isn't it desirable to be able to reboot with the same overlay and get back the previously downloaded images?

Yes, I think the downloaded images can be stored in encrypted block device which used as top overlay of rootfs. But from security perspective, these contents shoudn't be stored in that block device persistently. If so, we can't ensure the sandboxs' status are always both identical and as expected after reboot.

I don't understand that reasoning.

The part that we measure and attest is immutable by design
If the security of your system depends on the content of a non-root filesystem, then it's already broken
At least as far as this proposal is concerned, the encrypted disk only contains container images. It's a data cache, nothing more.
Container images need to be validated anyway, irrespective of where we got them from (network or local disk), which we do e.g. decrypting them using secrets we got from the KBS, or using a crypto signature. So no added security is provided by throwing away container images that were previously downloaded.

How do you initialize the secondary filesystem? I believe from your examples that you write zeroes in it, but those would not appears as zeroes in the guest due to encryption. If you were unlucky enough, those could be decoded using the tenant key as a valid, non-empty filesystem, no? (Admittedly improbable, but if the assumption is that it will not happen for statistical reasons, this has to be documented).

The dd command (create disk.img with zeroes) is only a demo about how to create a new disk image file in Host OS. In Guest OS, this secondary filesystem's will be mounted and initialized again with following commands before overlayfs mount.
cryptsetup -y luksFormat /dev/vdb     # this step will set encrypt/decrypt password of the block device
cryptsetup luksOpen /dev/vdb encrypted_disk   # this will decrypt the block device and mout to /dev/mapper/encrypted_disk
mkfs.ext4 /dev/mapper/encrypted_disk    # format this decrypt block device, this step can ensure the device will be initialized as expected.

Sorry, I was specifically asking about initialization in the host, although you make a very good point about the in-guest initialization of the FS above, that would also deserve to be in the issue description 😄 .

I would prefer if we were able to create sparse (qcow2-style) images for the encrypted disks. There is an obvious trade-off there, since on one hand, we don't want to use tons of disk for mostlyi unused space (with only transient use at that, since it's really only used before we launch the container), on the other hand, we don't want to risk running out of space during run, which could cause a DoS. A qcow2 image is more likely to cause the guest to be suspended if host disk space becomes insufficient. Also, I don't think that qcow2 and disk encryption are too friendly with one another, but I need to check about that.

In any case, if we could initialize the image using qemu-img rather than dd, and ideally create a sparse image, that might be better.

Relatedly, in your experiment, you see a message "unknow filesystem type 'crypto_LUKS'", which to me means that the host sees at least some meta-data of the guest disk in plain text. So could the host pass a bogus partition table or some other disk-level metadata to send the guest down a bogus path?

I think the plain text only notify this block deivce is encrypted by LUKS and it can't be mounted directly. The cryptsetup and LUKS had been upstreamed for several years. So I think this notification shouldn't expose security risk here.

Well, LUKS is definitely robust when used correctly. My concern here is that we may be using it incorrectly, given that the objective would be that not just the host, but the hypervisor itself, should not be able to see any cleartext. I have a regular meeting on Thursdays with someone who can answer this question, so I will ask and report then.

You could perfectly have a single disk image with multiple partitions and a device mapper structure on it, some partitions being mounted read-only, some being mounted read-write. However, that would not be convenient since you would need to rebuild the image each time the rw portion changes. So I think that "shared across multiple instances" (or multiple reboots) is really key. Could you please add that to your formulation of the problem?

OK, I will add following: Note: Although this proposal choose two separate block devices(rootfs image + writable filesystem), we can also use one block device with two partitions, one partition used as ro rootfs, other as writable filesystem. But this approach maybe exist "shared across multiple instances" problem because the image needs to be rebuild each time if the rw portion changes.

I respectfully disagree. A bind mount from multiple locations in the file system to your encrypted filesystem is quite trivial. Actually, in your example, you have only demonstrated mounting an encrypted filesystem in the sandbox, but you have not used overlayfs AFAICT. To complete the setup, you would need a bind mount from /mnt/enc/run/kata-containers to /run/kata-containers and if you wanted to also be able to write some stuff under /var/, you would have another bind mount from /mnt/enc/var to /var.

Currently, this proposal's experiment only demonstrate dm-crypt function. But I have tried overlayfs + encryption locally with the following commands and it seems workable:

# create
cryptsetup -y luksFormat /dev/vdb
# Open
cryptsetup luksOpen /dev/vdb encrypted_disk

mkdir /run/upper
mount /dev/mapper/encrypted_disk /run/upper
mkdir /run/upper/upper /run/upper/workdir
# mount /mnt/
mount -t overlay -o lowerdir=/mnt/,upperdir=/run/upper/upper,workdir=/run/upper/workdir overlay /mnt

And about the folder list (such as: /var, /run/kata-containers) that need to be mounted in order to complete the setup, I think it's better to create a separate issue to discuss it.

Here too, I respectfully disagree. Protecting the root filesystem from uncontrolled modifications is a key component of modern system security. From Red Hat Core OS to macOS, most systems have made it harder and harder to modify key system components. Many vulnerabilities in the past worked by replacing one component you had access too with a malicious one, and then causing some other (privileged) component to use the malicious replacement. Disk encryption would absolutely not protect you from any of these attacks, only keeping the root filesystem read only would.

Yes, leave the rootfs ro is reasonable. So we will update the proposal to designate only overlayfs mount encrypted block device to paths that need rw permission (such as /run/xxx). @jiangliu can you help to add details why mount as overlayfs is better than mount directly here?

Using overlayfs on top of the initial rootfs would not work, for two reasons:

In the latest proposal, we won't mount overlayfs on top of rootfs (/) directly. As you mentioned that it's necessary to keep rootfs "read-only" to enhance sandbox's security.

My understanding at the moment is that you mostly care about having an overlayfs mount for images. Any code that depends on overlayfs needs to be modified extensively anyway, since we move that overlayfs from host to guest, so it's a different Kata component that needs to take care of it (e.g. agent instead of runtime).

We think all sandbox cached data (which includes container images) need to be stored in encrypted block device which should can minimize the sandbox's memory usage.

I don't understand that reasoning.

The part that we measure and attest is immutable by design

If the security of your system depends on the content of a non-root filesystem, then it's already broken

At least as far as this proposal is concerned, the encrypted disk only contains container images. It's a data cache, nothing more.

Container images need to be validated anyway, irrespective of where we got them from (network or local disk), which we do e.g. decrypting them using secrets we got from the KBS, or using a crypto signature. So no added security is provided by throwing away container images that were previously downloaded.

Your points are reasonable but we think the encrypted block device's lifecycle shoud be aligned with pod's lifecycle. So the encrypted disk should be wiped and destroyed simultaneous if the pod is destroyed.

I would prefer if we were able to create sparse (qcow2-style) images for the encrypted disks. There is an obvious trade-off there, since on one hand, we don't want to use tons of disk for mostlyi unused space (with only transient use at that, since it's really only used before we launch the container), on the other hand, we don't want to risk running out of space during run, which could cause a DoS. A qcow2 image is more likely to cause the guest to be suspended if host disk space becomes insufficient. Also, I don't think that qcow2 and disk encryption are too friendly with one another, but I need to check about that.

We tried to test qcow2 image + dm-crypt with following commands and it seems workable:

# generate 1G disk image
qemu-img create -f qcow2 ./disk.qcow2 1G

# Following commands are executed in Guest OS
# create
cryptsetup -y luksFormat /dev/vdb
# Open
cryptsetup luksOpen /dev/vdb encrypted_disk
# Format
mkfs.ext4 /dev/mapper/encrypted_disk
# mount
mkdir /mnt/enc
mount /dev/mapper/encrypted_disk /mnt/enc

echo "This is Part 3 of a 12-article series about qcow2." > /mnt/enc/testfile.txt

I will update the proposal's experiment section to use qcow2 image as extra encrypted block device.

You could perfectly have a single disk image with multiple partitions and a device mapper structure on it, some partitions being mounted read-only, some being mounted read-write. However, that would not be convenient since you would need to rebuild the image each time the rw portion changes. So I think that "shared across multiple instances" (or multiple reboots) is really key. Could you please add that to your formulation of the problem?

OK, I will add following: Note: Although this proposal choose two separate block devices(rootfs image + writable filesystem), we can also use one block device with two partitions, one partition used as ro rootfs, other as writable filesystem. But this approach maybe exist "shared across multiple instances" problem because the image needs to be rebuild each time if the rw portion changes.

Oh no, please don't do that. I was just using that as an example to illustrate my question. I think that the text you need to add is not that we can use a single block device, but why we cannot 😄 Sorry if my explanation was confusing.

Oh no, please don't do that. I was just using that as an example to illustrate my question. I think that the text you need to add is not that we can use a single block device, but why we cannot 😄 Sorry if my explanation was confusing.

OK, I will remove it from the proposal.

Oh no, please don't do that. I was just using that as an example to illustrate my question. I think that the text you need to add is not that we can use a single block device, but why we cannot 😄 Sorry if my explanation was confusing.

OK, I will remove it from the proposal.

I don't see it currently in the text of the issue. Either you were very fast, or the "proposal" you are talking about is elsewhere?

We tried to test qcow2 image + dm-crypt with following commands and it seems workable:

I don't know for sure if qcow2 is really better for this use case. One of the benefits of qcow2 is that sparse images can take much less space when the disk is far from full. We need to verify how much we can save in our use case, and also consider the risk of running out of host disk space as the qcow2 grows, which would suspend the guest. So I would not hardcode qcow2 in the document yet, just make sure we keep it as an open option. This is why I suggested we use qemu-img which works both for sparse and raw images, as opposed to dd.

Currently, this proposal's experiment only demonstrate dm-crypt function. But I have tried overlayfs + encryption locally with the following commands and it seems workable:
# create
cryptsetup -y luksFormat /dev/vdb
# Open
cryptsetup luksOpen /dev/vdb encrypted_disk

mkdir /run/upper
mount /dev/mapper/encrypted_disk /run/upper
mkdir /run/upper/upper /run/upper/workdir
# mount /mnt/
mount -t overlay -o lowerdir=/mnt/,upperdir=/run/upper/upper,workdir=/run/upper/workdir overlay /mnt
And about the folder list (such as: /var, /run/kata-containers) that need to be mounted in order to complete the setup, I think it's better to create a separate issue to discuss it.

Sure, we can move it elsewhere, but I think we can still discuss the basic layout here. What do you think of the following:

Create and open luks, same as yours:

cryptsetup -y luksFormat /dev/vdb
cryptsetup luksOpen /dev/vdb encrypted_disk

Mount encrypted disk to some known location:

mkdir /mnt/encrypted
mount /dev/mapper/encrypted_disk /mnt/encrypted

Mount /run (or /var/run if you prefer) from encrypted disk

mkdir -p /mnt/encrypted/run
mount -t bind /mnt/encrypted/run /run

We expect unmodified image download code to use overlayfs in /run, check that they will be able to do that (where /run/images is a mock path for the container root filesystem):

mkdir -p /run/lower /run/upper/upper /run/upper/workdir /run/images
mount -t overlay -o lowerdir=/run/lower,upperdir=/run/upper/upper,workdir=/run/upper/workdir overlay /run/images

The reason for a setup like this is that I believe the image expansion code will use overlayfs to build the image that it presents to the workload, but other pieces, such as (encrypted) image download, attestation, key download, image decryption, will all need scratch space but not benefit from overlayfs at all. So we should not pay the cost of overlayfs where we don't need it, the whole confidential computing is already expensive enough as it is 😄

I don't know for sure if qcow2 is really better for this use case. One of the benefits of qcow2 is that sparse images can take much less space when the disk is far from full. We need to verify how much we can save in our use case, and also consider the risk of running out of host disk space as the qcow2 grows, which would suspend the guest. So I would not hardcode qcow2 in the document yet, just make sure we keep it as an open option. This is why I suggested we use qemu-img which works both for sparse and raw images, as opposed to dd.

I think your concern is right, I will add a separately "Questions" section to describle this question and other potential questions in future.

Sure, we can move it elsewhere, but I think we can still discuss the basic layout here. What do you think of the following:

Create and open luks, same as yours:
cryptsetup -y luksFormat /dev/vdb
cryptsetup luksOpen /dev/vdb encrypted_disk
Mount encrypted disk to some known location:
mkdir /mnt/encrypted
mount /dev/mapper/encrypted_disk /mnt/encrypted
Mount /run (or /var/run if you prefer) from encrypted disk
mkdir -p /mnt/encrypted/run
mount -t bind /mnt/encrypted/run /run
We expect unmodified image download code to use overlayfs in /run, check that they will be able to do that (where /run/images is a mock path for the container root filesystem):
mkdir -p /run/lower /run/upper/upper /run/upper/workdir /run/images
mount -t overlay -o lowerdir=/run/lower,upperdir=/run/upper/upper,workdir=/run/upper/workdir overlay /run/images
The reason for a setup like this is that I believe the image expansion code will use overlayfs to build the image that it presents to the workload, but other pieces, such as (encrypted) image download, attestation, key download, image decryption, will all need scratch space but not benefit from overlayfs at all. So we should not pay the cost of overlayfs where we don't need it, the whole confidential computing is already expensive enough as it is 😄

I have a question about how to handle following scenario:

The target path /run/(as example) has already included some Guest OS needed files before the execution of mount -t bind /mnt/encrypted/run /run
Run mount -t bind /mnt/encrypted/run /run which will hide these files in /run/
After that Guest OS wants to access these files in /run/ and it will fail beacuse these files had beed hiden by bind mount command. This maybe will cause Guest OS execution error.

Do you think this scenario is reasonable?

I have a question about how to handle following scenario:

The target path /run/(as example) has already included some Guest OS needed files before the execution of mount -t bind /mnt/encrypted/run /run

Run mount -t bind /mnt/encrypted/run /run which will hide these files in /run/

After that Guest OS wants to access these files in /run/ and it will fail beacuse these files had beed hiden by bind mount command. This maybe will cause Guest OS execution error.

Do you think this scenario is reasonable?

Ah, now I see why you wanted to use overlays here. While I used /run in the example, that was just as an example. I would expect that this trick would be used for directories we own, i.e. stuff the guest OS would not touch. The image download directory falls into that category. Arguably, it would be somewhere much deeper, e.g. /var/lib/containers/storage. I would not expect the OS to have used that (or even created it).

However, if there are cases where we really run into the situation you describe, then yes, you could obviously use overlayfs. Do you think this would be the default scenario?

Target

In Kata CCv1, we need to provide persistent storage with confidentiality protection

Should the "persistent storage" actually be "ephemeral storage"? Because those storage's lifetime are bound to the pod and it won't really be persistent.

In Kata CCv1, we need to provide persistent storage with confidentiality protection

Should the "persistent storage" actually be "ephemeral storage"? Because those storage's lifetime are bound to the pod and it won't really be persistent.

Well, as I pointed out above, to be consistent with the experience in the non-confidential case, where downloaded images persist from one launch of the container to the next, I would like even the image storage to be persistent. I can understand if there is an "ephemeral" option (similar to the podman run --rmi option) that cleans up once you ran. For one thing, I would recommend for this cleanup to happen on pod exit, not by "reinitializing" the storage before the next launch, if only for security or disk space reasons.

Should the "persistent storage" actually be "ephemeral storage"? Because those storage's lifetime are bound to the pod and it won't really be persistent.

Yes, the "ephemeral storage" is more precisely. I will update the proposal.

Well, as I pointed out above, to be consistent with the experience in the non-confidential case, where downloaded images persist from one launch of the container to the next, I would like even the image storage to be persistent. I can understand if there is an "ephemeral" option (similar to the podman run --rmi option) that cleans up once you ran. For one thing, I would recommend for this cleanup to happen on pod exit, not by "reinitializing" the storage before the next launch, if only for security or disk space reasons.

We think the storage should be ephemeral storage (apologize that I caused some confusion with persistent). So I updated the proposal to add following: Pod Destroy: The encrypted block device's image (disk.qcow2) should also be wiped and destroyed simultaneously when pod is destroyed

@c3d @liangzhou121 sorry for late to the party:) Thanks for the discussions and I have some different thoughts about this topic, so I have written a company design doc related to this topic. Please refer to: https://github.com/jiangliu/documentation/blob/storage/Confidential-Storage-Arch.md

I have a question about how to handle following scenario:

The target path /run/(as example) has already included some Guest OS needed files before the execution of mount -t bind /mnt/encrypted/run /run

Run mount -t bind /mnt/encrypted/run /run which will hide these files in /run/

After that Guest OS wants to access these files in /run/ and it will fail beacuse these files had beed hiden by bind mount command. This maybe will cause Guest OS execution error.

Do you think this scenario is reasonable?

Ah, now I see why you wanted to use overlays here. While I used /run in the example, that was just as an example. I would expect that this trick would be used for directories we own, i.e. stuff the guest OS would not touch. The image download directory falls into that category. Arguably, it would be somewhere much deeper, e.g. /var/lib/containers/storage. I would not expect the OS to have used that (or even created it).

I think overlayfs makes sense for pretty generic cases where you would want to use this additional block device for essentially adding a rw layer on top of a ro mounted guest image. That's an interesting goal, but I think that there's something dysfunctional about a guest trying to write somewhere on a ro filesystem (outside of tmpfs/RAM). I'd prefer to see the guest failing rather than letting it follow unexpected behaviors, even more so in a confidential computing context.

What I'm trying to say here is that although the generic case is interesting, in most cases we likely don't want to allow for it (building an rw layer on top of a ro root filesystem). What we're really trying to achieve here is using an additional block device to store container images and unpack them there. And avoid using RAM for that purpose. Since we control the guest and the image management layer inside the guest, why not simply mount that disk at a well known place and tell the guest management layer to use that/those mount points to download and unpack container images? The image-rs crate will be able to use any place in the filesystem to download and then unpack. Then kata-agent will be able to use the unpacked location as container bundles as well.

If we restrict ourselves to that scope, and support ephemeral storage only for now, the flow would look like:

Pass an additional block device to the CC guest
kata-agent wipes and encrypts the block device from the guest. It's now protected from the host.
kata-agent formats and mounts the block device at a known place inside the guest (e.g. /mnt/ctr-storage)
kata-agent runs the attestation process and gets provisioned with container image keys
kata-agent uses image-rs to pull the compressed and encrypted image layers somewhere under /mnt/ctr-storage)
image-rs decrypts the image layers and unpack the container images under /mnt/ctr-storage. For example, container foo's image bundle would eventually end up at /mnt/ctr-storage/foo/bundle
kata-agent runs the container from its bundle, who happens to be stored by an encrypted block device.

Reducing our focus to using ephemeral storage only for storing container image layers and unpacked bundles makes things a little simpler, I think.

@liangzhou121 @jiangliu @c3d Does that make sense?

@liangzhou121 @jiangliu @c3d Does that make sense?

Makes sense to me, and is essentially what I was advocating for.

In steps following 7, there may still be an overlayfs involved if you expand multi-layered images, but that overlayfs resides on the mounted storage, i.e. /mnt/ctr-storage in your example.

@liangzhou121 is this issue still relevant or can be closed? If it's still relevant to what release do you think we should map it to (mid-November, end-December, mid-February etc...)?

@liangzhou121 is this issue still relevant or can be closed? If it's still relevant to what release do you think we should map it to (mid-November, end-December, mid-February etc...)?

Yes, this issue is covered by jiangliu's new proposal, so it can be closed directly now.

confidential-containers / documentation

Proposal of CC Storage Security #20

Current situation

Target

Proposal

Questions

dm-crypt Experiment

Create block device image

Create sandbox

Mount and format block device

Verification

Target