IaaS standard on user data backup

markus-hentsch commented 6 months ago

As a CSP I want to know where user data[^1] is aggregated, how it can be backed up and which standards SCS establishes in regards to those backups.

As a customer I want clear documentation and guidelines on how to backup my user data[^1] using native OpenStack mechanisms or alternatives compatible with SCS clouds.

[^1]: user data as in data uploaded by the user to the cloud and data generated by the user in the cloud at runtime (e.g. VM disk filesystems). This excludes network traffic, RAM contents and IDM data in Keystone as well as cloud resource configuration data (VM, volume, network metadata etc.).

Definition of Done:

[x] Overview of user data sources has been documented
[x] Appropriate backup mechanisms have been identified to cover all backup sources
[x] A decision record and/or standard proposal has been written for the CSP side
[ ] In case of a new standard: proposal has been voted upon and changed into a draft
[x] A user guide documentation has been written for the customer side

markus-hentsch commented 6 months ago

Current State

Backup sources:

Glance Images
Nova Server Disks (Ephemeral Storage)
Cinder Volumes
Barbican Secrets

1. Glance Images

1.1. Glance image download

Glance images can be downloaded via openstack image save or the corresponding API action and then be stored outside of the infrastructure by the user. OpenStack does not include any dedicated backup mechanisms for Glance images aside from that.

Source: Glance storage backend
Target: Download location

On the contrary, Glance itself acts as a backup target for Ephemeral Storage disks of VMs in Nova or volume in Cinder, see section II.

2. Nova Server Disks (Ephemeral Storage)

2.1. Nova disk to Glance image (Nova createImage API action)

When openstack server image create is used on a VM that uses an Ephemeral Storage disk, a full image of the disk is created and stored in Glance. This acts as a full backup of the original disk data.

Note: If the VM also has Cinder volumes attached to it, they will not be included in the Glance image. See the server image create details in section III about volumes.

Source: Nova Ephemeral Storage backend
Target: Glance image storage backend

2.1.a Shelve action

For the openstack server shelve action, a full disk image is created if there is Ephemeral Storage involved. Any attached Cinder volumes are simply detached while the VM is in the SHELVED_OFFLOADED state. No volume snapshots are created. If the VM has no Ephemeral Storage, no image is created in Glance.

Source: Nova Ephemeral Storage backend
Target: Glance image storage backend

3. Cinder Volumes

3.1. Cinder Backup API

When openstack volume backup create is used, an (optionally incremental) backup of the volume data is stored in the backup backend.

Backup backends are Swift, NFS, GlusterFS among others. The backup backend must be configured in Cinder.

Note: Encrypted volumes share the same Barbican key and LUKS encryption with their backups.

Source: Cinder storage backend (e.g. Ceph RBD, LVM, etc.)
Target: Cinder backup storage backend (e.g. NFS, Swift, etc.)

3.2. Nova createImage API action

When openstack server image create is used on a VM that has one or more volumes attached to it, the following happens:

An image is registered in Glance (metadata only).
If the boot disk is Ephemeral Storage, an image of the Ephemeral disk is written to the Glance image.
For every volume attached to the VM a snapshot is created in Cinder and the snapshot reference is stored in the Glance image.

This means that a VM with only volumes attached will result in an image that does only consist of metadata and links to Cinder snapshot references but no actual binary image data in Glance itself!

Note: For volumes this action only creates snapshots which are not considered backups because they reside in the same storage backend as the volumes themselves and aren't full copies.

Source: Cinder storage backend
Target: Cinder storage backend (!)

3.3. Cinder volume to Glance image

When openstack image create --volume is used on a volume, a full image of the binary data of the volume will be created and uploaded to Glance.

Note that this does not work on volumes currently attached to VMs. To avoid having to detach them, a detour using a volume snapshot can be taken as shown below.

3.3.a Attached volume to Glance image

Create a snapshot of the volume via openstack volume snapshot create --volume.
Create a temporary volume based on the snapshot via openstack volume create --snapshot.
Create the image from the temporary volume via openstack image create --volume.
Delete the temporary volume using openstack volume delete.

Warning: Creating a Glance image from an encrypted Cinder volume will store the LUKS-encrypted data blocks in the image. This image is useless without the corresponding encryption key stored in a Barbican secret!

Source: Cinder storage backend
Target: Glance storage backend

4. Barbican secrets

Barbican secrets can originate from one of two actions:

A Cinder volume* is created with an encrypted type. OpenStack automatically generates and stores the encryption key as a secret in Barbican.
A user utilizes the Barbican API directly to either let Barbican generate a secret or to upload a secret to Barbican.

* in case of encrypted volumes the key stored in Barbican is crucial for the volume data to be useful. This extends to volume backups created via the Cinder Backup API or Cinder to Glance image API action, since volume data is backed up in encrypted form!

4.1. Barbican secret download

Barbican does not offer any backup mechanisms for secrets. A secret can be retrieved in plaintext using openstack secret get -p or the corresponding API endpoint. It is then the user's responsibility to appropriately store and protect it as a backup.

Source: Barbican database
Target: Download location

2024-03-26 SCS user data backup overview-Seite-1 drawio

2024-03-26 SCS user data backup overview-Seite-2 drawio

markus-hentsch commented 6 months ago

Backups of encrypted volumes become useless if the encryption key is not backed up as well. When handling encrypted volumes, there are two things to consider:

Encryption keys are stored in Barbican and referenced via ID in the volume metadata.
Encryption keys extracted from Barbican cannot be used directly as LUKS passphrases due to additional processing OpenStack is applying to them.

1. Retrieving encryption keys

Before Wallaby, point 1 was problematic for users since the secret reference was not visible to them via the API, so the key could not be identified in Barbican easily. However, Cinder added the visibility of encryption_key_id to the volume API in microversion 3.64^1 which is available since Wallaby.

Open question: do Glance images created from encrypted volumes have the key ID reference added to their metadata? I believe this should be the case otherwise they couldn't be restored properly?

2. Converting Barbican secrets to work with image backups

Glance images created from encrypted Cinder volumes using openstack image create --volume will carry raw LUKS-encrypted data blocks in them, meaning the image is effectively encrypted using the original volume's encryption. This also means that the encryption is still bound to the same encryption key (LUKS passphrase) that is stored in Barbican and referenced by the encryption_key_id attribute of the source volume.

OpenStack uses a character transformation to convert potentially binary encryption keys to valid ASCII using binascii.hexlify()^3 before passing them to cryptsetup as passphrases for LUKS disk encryption.

This means that secrets downloaded from Barbican must pass the same conversion in case the image data (LUKS encrypted) is to be decrypted outside of OpenStack for backup restoration purposes.

markus-hentsch commented 6 months ago

I used an extended DevStack environment^1 to answer the open questions above:

Question 1

Open question: do Glance images created from encrypted volumes have the key ID reference added to their metadata? I believe this should be the case otherwise they couldn't be restored properly?

The secret of the volume is cloned and the clone's ID is then bound to the image and referenced as properties.cinder_encryption_key_id in the image's metadata:

openstack volume show ...

+------------------+-----------------------------------------------------------------------------------------+
| Field            | Value                                                                                   |
+------------------+-----------------------------------------------------------------------------------------+
| ...              | ...                                                                                     |
| owner            | d3c3a86fda9c4190960cbd1e9496ab82                                                        |
| properties       | cinder_encryption_key_deletion_policy='on_image_deletion',                              |
|                  | cinder_encryption_key_id='55f142f0-4915-47fe-80ed-69b34aa77e7f', hw_rng_model='virtio', |
|                  | ...                                                                                     | 
| ...              | ...                                                                                     | 
+------------------+-----------------------------------------------------------------------------------------+

Question 2

This means that secrets downloaded from Barbican must pass the same conversion in case the image data (LUKS encrypted) is to be decrypted outside of OpenStack for backup restoration purposes.

I was able to successfully mimick what OpenStack does with LUKS and the Barbican secret outside of Glance/Cinder using the following procedure:

openstack image save --file image.raw $IMAGE_NAME_OR_ID

openstack image show -f value -c properties $IMAGE_NAME_OR_ID
# (use the value of `cinder_encryption_key_id` as `$SECRET_ID` below)
openstack secret get --file image.key --payload_content_type "application/octet-stream" $SECRET_ID

python3 -c "import binascii; \
    f = open('image.key', 'rb'); \
    print(binascii.hexlify(f.read()).decode('utf-8'))" \
    | sudo cryptsetup luksOpen ./image.raw decrypted_image

The image's contents are now loaded as /dev/mapper/decrypted_image and can be mounted or snapshotted.

markus-hentsch commented 6 months ago

I've integrated the findings of the previous comment about the secret handling into the docs PR.

berendt commented 6 months ago

user data as in data uploaded by the user to the cloud

Should we also add the user data provided via the meta data service to a running instance?

markus-hentsch commented 6 months ago

user data as in data uploaded by the user to the cloud
Should we also add the user data provided via the meta data service to a running instance?

Are you referring to the script/configuration data that can be passed as user_data in Nova's server POST /servers API request^1?

Your statement makes it sound like something that can be added at runtime but I'm not aware of anything after the creation of the server. It doesn't seem like PUT /servers allows user_data to be modified judging from the API docs.

In any case good point, I'll have a look.

markus-hentsch commented 6 months ago

FTR, I also inspected the behavior of Cinder Backup and encrypted volumes after adding Swift and Cinder Backup to my DevStack.

When using openstack volume backup create on a volume that is using a volume type with the LUKS encryption:

the encryption is inherited from the source volume since the encrypted LUKS blocks are simply copied over into the backup store (e.g. Swift)
the Barbican secret containing the LUKS encryption key is cloned and the clone's ID is then referenced as encryption_key_id in the backup's metadata

This is mostly identical to the behavior of openstack image create --volume in regards to the handling of key and encryption.

berendt commented 6 months ago

user data as in data uploaded by the user to the cloud
Should we also add the user data provided via the meta data service to a running instance?
Are you referring to the script/configuration data that can be passed as user_data in Nova's server POST /servers API request1?

Your statement makes it sound like something that can be added at runtime but I'm not aware of anything after the creation of the server. It doesn't seem like PUT /servers allows user_data to be modified judging from the API docs.

In any case good point, I'll have a look.

Footnotes

https://docs.openstack.org/api-ref/compute/#id11 ↩

I was thinking more about what happens if you use the backup for a migration to another cloud or have a DR case where you have to restore everything. In that case it makes sense IMO to backup the user data as well, even if it can't be modified. The user data itself is created by some external tool (e.g. Terraform).

markus-hentsch commented 6 months ago

I was thinking more about what happens if you use the backup for a migration to another cloud or have a DR case where you have to restore everything. In that case it makes sense IMO to backup the user data as well, even if it can't be modified. The user data itself is created by some external tool (e.g. Terraform).

I had a look and this is stored in the OS-EXT-SRV-ATTR:user_data of openstack server show however it is only visible to admins as the API documentation^1 states:

The user_data the instance was created with. By default, it appears in the response for administrative users only.

I have verified this and indeed the field is shown as empty if the API call is made as a normal user (even if it was the creator of the server) and only shows up when authenticated as admin.

This is one thing we could change at the CSP side of things by creating a standard/decision that CSPs have to adjust their Nova API policy to make this field visible to users in the project.

markus-hentsch commented 6 months ago

This is one thing we could change at the CSP side of things by creating a standard/decision that CSPs have to adjust their Nova API policy to make this field visible to users in the project.

It doesn't seem to be that easy because the policy file is not fine-grained enough: the visibility of all OS-EXT-SRV-ATTR:* metadata attributes are controlled by a single policy rule^1. This means exposing much more attributes than just the user_data (including the compute host identity) to the user which is most likely not desired by the CSP.

I don't think we can offer any means of retrieving the originally supplied user_data to customers with the current Compute API.

markus-hentsch commented 5 months ago

Cinder Backup problems with encrypted volume backups

While researching and testing proper instructions for handling Cinder volume backups, I stumbled upon some issues related to encrypted volumes.

When the type of the volume, from which a backup is created, is a non-default encrypted type, things get messy. Consider the following scenario:

# create an encrypted (non-default) volume type
openstack volume type create \
--property volume_backend_name='lvmdriver-1' \
--encryption-provider luks \
--encryption-cipher aes-xts-plain64 \
--encryption-key-size 256 \
--encryption-control-location front-end \
lvmdriver-1-LUKS

# create encrypted volume
openstack volume create --size 2 --image "cirros-0.6.2-x86_64-disk" --type lvmdriver-1-LUKS encrypted-volume

# create backup from encrypted volume
openstack volume backup create --name volume-backup encrypted-volume

# restore backup into new volume
openstack volume backup restore volume-backup new-volume-restored-from-backup

# check the status of the volume
openstack volume show new-volume-restored-from-backup -f value -c status

    error_restoring

When inspecting the log output of the Cinder Backup service, the following log message can be seen:

cinder.exception.EncryptedBackupOperationFailed: The source volume
type 'a935d148-0d0f-4c25-8459-669f77871c92' is different than the
destination volume type '165c9c5e-477a-4a35-965b-58e504ff4ae3'.

The openstack volume backup restore command seems to lack a parameter for specifying a volume type. The same goes for the /v3/{project_id}/backups/{backup_id}/restore API. Thus, it is only possible to restore such a backup by creating an empty and sufficiently sized volume beforehand and then force-restoring onto it:

openstack volume create --size 2 --type lvmdriver-1-LUKS empty-volume
openstack volume backup restore --force volume-backup empty-volume

openstack volume delete empty-volume encrypted-volume

# switch to admin and delete the volume type (DevStack example)
source openrc admin admin
openstack volume type delete lvmdriver-1-LUKS

Now, the volume type that the backup originally was based on has been deleted, rendering the backup unusable because the matching volume type ID cannot be achieved by any new volume.

In summary, this poses the following problems:

If the source volume type of a volume from which a volume backup is created is an encrypted non-default volume type, restoring the backup will fail unless a volume with the exact type is created beforehand and overwritten by the backup. Users are unable to specify a volume type in openstack volume backup restore (and the corresponding API) and cannot restore the backup on a new (yet-to-be-created) volume.
Even if users were able to specify, a volume type, the source volume type is not shown in openstack volume backup show or the corresponding API. A user cannot determine the correct volume type based on the backup resource alone. Only the Cinder Backup service log file contains the mismatching type IDs as shown in the quoted error message above.
The existence of the backup based on the encrypted type does not prevent the deletion of the volume type in Cinder like it happens for volumes. Deletion of the volume type effectively renders the backup unusable.

Here is the code part in Cinder Backup that compares the volume type IDs:

https://github.com/openstack/cinder/blob/d46e2ebbd719b01aa3497853332fda1b724c281d/cinder/backup/driver.py#L204-L213

markus-hentsch commented 5 months ago

I reported the issues identified in https://github.com/SovereignCloudStack/standards/issues/541#issuecomment-2056574330 as https://bugs.launchpad.net/cinder/+bug/2061458 upstream.

markus-hentsch commented 5 months ago

For the CSP side of things, I drafted a standard at #567 to make Cinder volume backup mandatory. This ties in with what I wrote down in the user guide docs PR in regards to how to use the functionality.

Beyond that I don't feel confident to create any other CSP-side standards on the IaaS user data backup topic and I think #527 is a better place for a holistic approach to CSP-side backups in general.

Combining the availability of the volume backup functionality as per #567 and the user guide of SovereignCloudStack/docs#176 should give users the basic tools needed to create backups of their IaaS resources' data if necessary.

SovereignCloudStack / standards