ceph: Add support for full disk encryption of disks deployed as part of the distributed storage

gabrielmougard commented 1 month ago

JIRA ticket: https://warthogs.atlassian.net/browse/LXD-910

We would like to be able to configure Ceph full disk encryption (FDE) as part of a MicroCloud deployment. The existing MicroCeph API already allows to pass an Encrypt boolean field to encrypt data at-rest at the disk level through the use of LUKS. Each generated keyring for the associated disk OSD is stored in the internel Ceph configuration folder.

We propose to introduce a new --encrypt flag for the microcloud init | add commands to encrypt all the disks selected to be part of the Ceph deployment. We also introduce a way to select the disk to encrypt just like how it is currently done with the disk wipe operation.

One important thing to note is that, Ceph will use dm_crypt under the hood to create a virtual encrypted device for each selected disk device. If choosing to encrypt a disk on a certain node, the user must ensure that the dm_crypt kernel module is enabled (it is by default enabled for LXD ubuntu:24.04 VM images : grep -i dm_crypt config-6.8.0-31-generic should output CONFIG_DM_CRYPT=m)

TODO

[x] Update MicroCloud integration test suite

gabrielmougard commented 1 month ago

@ru-fu updated

gabrielmougard commented 1 month ago

@simondeziel the issue shown here looks similar to the one we saw in Madrid (same failure in the test case) while we discovered a potential race condition in the MicroCloud test suite. But this time, this is part of the preseed test so I don't see why it could be related to a race. As per the interactive test failing, this is an other GH runner timeout in the interactive combination so I don't think we can do much here...

simondeziel commented 1 month ago

But this time, this is part of the preseed test so I don't see why it could be related to a race.

@masnax, does the preseed case actually side-step the disk selection/table creation race?

masnax commented 1 month ago

But this time, this is part of the preseed test so I don't see why it could be related to a race.

@masnax, does the preseed case actually side-step the disk selection/table creation race?

Yes, it doesn't use the tables at all

gabrielmougard commented 1 month ago

@simondeziel @masnax tests are ok.

masnax commented 1 month ago

Just a general design question, not sure if it makes sense at all:

Is it reasonable to expect per-disk encryption? Since the data is replicated should the encryption be all-or-nothing and not per-disk?

gabrielmougard commented 1 month ago

Just a general design question, not sure if it makes sense at all:

Is it reasonable to expect per-disk encryption? Since the data is replicated should the encryption be all-or-nothing and not per-disk?

The MicroCeph API seems to only works with a per-disk encryption basis (see here). But if I'm following you, do you mean that, among the chosen disks for distributed storage, and if we chose to have encryption, we enforce the fact that either all the disks are encrypted or not at all?

masnax commented 1 month ago

Just a general design question, not sure if it makes sense at all: Is it reasonable to expect per-disk encryption? Since the data is replicated should the encryption be all-or-nothing and not per-disk?

The MicroCeph API seems to only works with a per-disk encryption basis (see here). But if I'm following you, do you mean that, among the chosen disks for distributed storage, and if we chose to have encryption, we enforce the fact that either all the disks are encrypted or not at all?

Yes I meant just having a yes/no question for disk encryption that sets the field for every disk payload that we send to the MicroCeph API.

gabrielmougard commented 1 month ago

@simondeziel in the 'basic' test suite (test_instances_config) I have a Error: Failed to run: zpool create -f -m none -O compression=on local /dev/sdb: exit status 1 (/dev/sdb is in use and contains a unknown filesystem.) .. it seems that somehow a device has not been properly wiped. Do you know from where it could come from?

simondeziel commented 1 month ago

@simondeziel in the 'basic' test suite (test_instances_config) I have a Error: Failed to run: zpool create -f -m none -O compression=on local /dev/sdb: exit status 1 (/dev/sdb is in use and contains a unknown filesystem.) .. it seems that somehow a device has not been properly wiped. Do you know from where it could come from?

IIRC, @masnax ran into this multiple times and a theory was that LXD didn't properly wipe the disks sometimes.

The "is in use" here is a bit more worrying than not being properly wiped though.

tomponline commented 1 month ago

@simondeziel in the 'basic' test suite (test_instances_config) I have a Error: Failed to run: zpool create -f -m none -O compression=on local /dev/sdb: exit status 1 (/dev/sdb is in use and contains a unknown filesystem.) .. it seems that somehow a device has not been properly wiped. Do you know from where it could come from?

IIRC, @masnax ran into this multiple times and a theory was that LXD didn't properly wipe the disks sometimes.

The "is in use" here is a bit more worrying than not being properly wiped though.

I believe he originally thought that but then found it was a duffrrebt issue

masnax commented 1 month ago

@simondeziel in the 'basic' test suite (test_instances_config) I have a Error: Failed to run: zpool create -f -m none -O compression=on local /dev/sdb: exit status 1 (/dev/sdb is in use and contains a unknown filesystem.) .. it seems that somehow a device has not been properly wiped. Do you know from where it could come from?

IIRC, @masnax ran into this multiple times and a theory was that LXD didn't properly wipe the disks sometimes. The "is in use" here is a bit more worrying than not being properly wiped though.

I believe he originally thought that but then found it was a duffrrebt issue

No, it's the same issue, which is that zpool destroy local can sometimes silently fail to destroy the zpool. I believe the solution we came to was to update LXD to attempt zpool destroy -f local as a fallback, but I haven't actually made a PR for that yet.

I've submitted a PR for it now: https://github.com/canonical/lxd/pull/13513

tomponline commented 1 month ago

@simondeziel in the 'basic' test suite (test_instances_config) I have a Error: Failed to run: zpool create -f -m none -O compression=on local /dev/sdb: exit status 1 (/dev/sdb is in use and contains a unknown filesystem.) .. it seems that somehow a device has not been properly wiped. Do you know from where it could come from?

IIRC, @masnax ran into this multiple times and a theory was that LXD didn't properly wipe the disks sometimes. The "is in use" here is a bit more worrying than not being properly wiped though.

I believe he originally thought that but then found it was a duffrrebt issue

No, it's the same issue, which is that zpool destroy local can sometimes silently fail to destroy the zpool. I believe the solution we came to was to update LXD to attempt zpool destroy -f local as a fallback, but I haven't actually made a PR for that yet.

I've submitted a PR for it now: canonical/lxd#13513

Id like to know more about the scenario and when it occurs before we potentially paper over a separate issue with "-f" incorrectly.

masnax commented 1 month ago

@simondeziel in the 'basic' test suite (test_instances_config) I have a Error: Failed to run: zpool create -f -m none -O compression=on local /dev/sdb: exit status 1 (/dev/sdb is in use and contains a unknown filesystem.) .. it seems that somehow a device has not been properly wiped. Do you know from where it could come from?

IIRC, @masnax ran into this multiple times and a theory was that LXD didn't properly wipe the disks sometimes. The "is in use" here is a bit more worrying than not being properly wiped though.

I believe he originally thought that but then found it was a duffrrebt issue

No, it's the same issue, which is that zpool destroy local can sometimes silently fail to destroy the zpool. I believe the solution we came to was to update LXD to attempt zpool destroy -f local as a fallback, but I haven't actually made a PR for that yet. I've submitted a PR for it now: canonical/lxd#13513

Id like to know more about the scenario and when it occurs before we potentially paper over a separate issue with "-f" incorrectly.

As discussed here: https://github.com/canonical/microcloud/pull/300#discussion_r1589304021

zpool destroy <pool> will sometimes return successfully but fail to actually do anything. In such cases, zpool destroy -f <pool> will correctly remove the pool.

tomponline commented 1 month ago

@simondeziel in the 'basic' test suite (test_instances_config) I have a Error: Failed to run: zpool create -f -m none -O compression=on local /dev/sdb: exit status 1 (/dev/sdb is in use and contains a unknown filesystem.) .. it seems that somehow a device has not been properly wiped. Do you know from where it could come from?

IIRC, @masnax ran into this multiple times and a theory was that LXD didn't properly wipe the disks sometimes. The "is in use" here is a bit more worrying than not being properly wiped though.

I believe he originally thought that but then found it was a duffrrebt issue

No, it's the same issue, which is that zpool destroy local can sometimes silently fail to destroy the zpool. I believe the solution we came to was to update LXD to attempt zpool destroy -f local as a fallback, but I haven't actually made a PR for that yet. I've submitted a PR for it now: canonical/lxd#13513

Id like to know more about the scenario and when it occurs before we potentially paper over a separate issue with "-f" incorrectly.

As discussed here: #300 (comment)

zpool destroy <pool> will sometimes return successfully but fail to actually do anything. In such cases, zpool destroy -f <pool> will correctly remove the pool.

Ah ok so we should expect this to reveal the underlying issue rather than fix it. Got you.

masnax commented 1 month ago

Ah ok so we should expect this to reveal the underlying issue rather than fix it. Got you.

Well, not really. It's definitely just hand-waving the issue because we are successfully removing the pool with --force. As for why that works, and why not using --force can return without an error but with the pool still existing is a mystery to me.

simondeziel commented 1 month ago

I've submitted a PR for it now: canonical/lxd#13513

Id like to know more about the scenario and when it occurs before we potentially paper over a separate issue with "-f" incorrectly.

I'd be curious to see if there is anything of interest in zpool events -v <pool> when the first zpool destroy <pool> command fails.

tomponline commented 1 month ago

As for why that works, and why not using --force can return without an error but with the pool still existing is a mystery to me.

This is why im hesitant to merge. Doesnt seem like weve got a good enough handle on what's happening yet to make an informed decision either way. You can't forceably unmount in Linux (only defer in background).

masnax commented 1 month ago

@simondeziel And here's what I get when I grep for destroy:

May 28 2024 21:19:02.108750009 sysevent.fs.zfs.pool_destroy
        version = 0x0
        class = "sysevent.fs.zfs.pool_destroy"
--
        pool_state = 0x0
        pool_context = 0x6
        history_hostname = "micro01"
        history_internal_str = "feature@async_destroy=enabled"
--
        history_hostname = "micro01"
        history_dsname = "local/virtual-machines/v1.block"
        history_internal_str = "(bptree, mintxg=1)"
        history_internal_name = "destroy"
--
        history_hostname = "micro01"
        history_dsname = "local/virtual-machines/v1"
        history_internal_str = "(bptree, mintxg=1)"
        history_internal_name = "destroy"

EDIT:

Nevermind, this was a separate issue with the non-snapshot tests. I was hoping I could get a quick reproducer, but not quite yet :(

simondeziel commented 1 month ago

"feature@async_destroy=enabled"

This seems like a feature we'd probably want to disable prior to doing the zpool destroy from LXD. Presumably this would get us the error synchronously.

gabrielmougard commented 1 week ago

@simondeziel do you know why we see a Error: Failed instance creation: Failed creating instance from image: Failed to activate volume "f06aa34c14c9863617b1b2b2206bb62ffb085cd9027dc966103565fa6a10b10c" in the test setup phase here?

simondeziel commented 1 week ago

@simondeziel do you know why we see a Error: Failed instance creation: Failed creating instance from image: Failed to activate volume "f06aa34c14c9863617b1b2b2206bb62ffb085cd9027dc966103565fa6a10b10c" in the test setup phase here?

I believe that's a known (recent) regression @MusicDin's already working on fixing on https://github.com/canonical/lxd/pull/13662

gabrielmougard commented 1 week ago

@MusicDin @simondeziel are the failing tests here related to https://github.com/canonical/lxd/pull/13662 ? We actually don't have much information on why the VMs are failing to start here :/ Is it an other regression observed in LXD?

MusicDin commented 1 week ago

The regresion in zfs volume activation should be fixed now. Am I properly interpreting that the tests are failing because the expected status is READY but reported status (by LXD) is RUNNING?

  + for round in $(seq 1 5 150)
  ++ lxc list -f csv -c s micro01
  + '[' RUNNING = READY ']'
  + '[' 26 = 0 ']'
  + sleep 5
  + echo '    micro01 VM failed to start'
      micro01 VM failed to start

tomponline commented 1 week ago

The regresion in zfs volume activation should be fixed now. Am I properly interpreting that the tests are failing because the expected status is READY but reported status (by LXD) is RUNNING?
  + for round in $(seq 1 5 150)
  ++ lxc list -f csv -c s micro01
  + '[' RUNNING = READY ']'
  + '[' 26 = 0 ']'
  + sleep 5
  + echo '    micro01 VM failed to start'
      micro01 VM failed to start

Should be fixed now in latest/edge https://launchpad.net/~canonical-lxd/lxd/+snap/lxd-latest-edge

tomponline commented 1 week ago

wait i take that back...

tomponline commented 1 week ago

wait i take that back...

no it should be fine in latest/edge, but not 5.21/edge or 5.0/edge yet

MusicDin commented 1 week ago

no it should be fine in latest/edge, but not 5.21/edge or 5.0/edge yet

I may be wrong, but this does not seem as the volume activation issue - because at the end of the test it reports that the instance is running, it also gets the IP addresses. Yesterday it was failing with something similar to Failed to activate volume ..

 ++ lxc list -f csv -c s micro01
  + '[' RUNNING = READY ']'
  + '[' 26 = 0 ']'
  + sleep 5
  + echo '    micro01 VM failed to start'
      micro01 VM failed to start
  + return 1
  + cleanup
  + set -eux
  + lxc remote switch local
  + lxc project switch microcloud-test
  + set +e
  + '[' setup = setup ']'
  + '[' failure = success ']'
  + '[' -n '' ']'
  + lxc list --all-projects
  +-----------------+---------+---------+-----------------------+------------------------------------------------+-----------------+-----------+
  |     PROJECT     |  NAME   |  STATE  |         IPV4          |                      IPV6                      |      TYPE       | SNAPSHOTS |
  +-----------------+---------+---------+-----------------------+------------------------------------------------+-----------------+-----------+
  | microcloud-test | micro01 | RUNNING | 10.57.120.50 (enp5s0) | fd42:51b:91c0:6cc9:216:3eff:fe65:4879 (enp5s0) | VIRTUAL-MACHINE | 0         |
  +-----------------+---------+---------+-----------------------+------------------------------------------------+-----------------+-----------+

But the test is expecting READY state, not RUNNING? https://github.com/canonical/microcloud/blob/main/test/includes/microcloud.sh#L1108-L1131

simondeziel commented 1 week ago

I may be wrong, but this does not seem as the volume activation issue - because at the end of the test it reports that the instance is running, it also gets the IP addresses. Yesterday it was failing with something similar to Failed to activate volume ..

Yep, the ZFS activation issue is no longer a blocker (thanks) and it seems like the add test failed due to a timeout reaching READY state, probably due to a slower than normal runner. I'll try to switch to the bigger runners now if that's quick.

gabrielmougard commented 1 week ago

@tomponline @simondeziel after sshing into the runner with tmate (thanks @tomponline for the tip), I noticed that the cloud-init script not setting the VM to a READY state issue, was somehow randomly happening (I created many VMs with the same cloud-init profile and some were ever RUNNING, some were READY but there were just a few)

I noticed that settings limits.cpu=4 instead of 2 dramatically increased the rate of success though. I also wrapped the lxd_create / lxd_wait logic in a short retry loop to eliminate the outliers... It seems to work now so far.

simondeziel commented 1 week ago

I noticed that settings limits.cpu=4 instead of 2 dramatically increased the rate of success though. I also wrapped the lxd_create / lxd_wait logic in a short retry loop to eliminate the outliers... It seems to work now so far.

I'm fine with that especially since we are starting VMs serially (CONCURRENT_SETUP=0), this might cause issue to those running them concurrently like @masnax does locally IIRC.

masnax commented 1 week ago

I noticed that settings limits.cpu=4 instead of 2 dramatically increased the rate of success though. I also wrapped the lxd_create / lxd_wait logic in a short retry loop to eliminate the outliers... It seems to work now so far.

I'm fine with that especially since we are starting VMs serially (CONCURRENT_SETUP=0), this might cause issue to those running them concurrently like @masnax does locally IIRC.

No issue from me, but I hope the 3-node tests continue to be happy with the over-commitment.

masnax commented 6 days ago

@gabrielmougard looks like the test package needs a rebase now.

masnax commented 5 days ago

@gabrielmougard Another rebase, sorry :)

masnax commented 4 days ago

@gabrielmougard You need to remove the systemd-journald related masks from microcloud.sh as that seems to be what's breaking cloud-init.

simondeziel commented 4 days ago

@gabrielmougard You need to remove the systemd-journald related masks from microcloud.sh as that seems to be what's breaking cloud-init.

Indeed, looks like it was lost in the merge conflict resolution. There should only be one masked systemd-journald and that's the systemd-journal-flush.service one. The other 2 should be gone.

canonical / microcloud

ceph: Add support for full disk encryption of disks deployed as part of the distributed storage #308

TODO