Open gabrielmougard opened 1 month ago
@ru-fu updated
@simondeziel the issue shown here looks similar to the one we saw in Madrid (same failure in the test case) while we discovered a potential race condition in the MicroCloud test suite. But this time, this is part of the preseed test so I don't see why it could be related to a race. As per the interactive test failing, this is an other GH runner timeout in the interactive combination so I don't think we can do much here...
But this time, this is part of the preseed test so I don't see why it could be related to a race.
@masnax, does the preseed case actually side-step the disk selection/table creation race?
But this time, this is part of the preseed test so I don't see why it could be related to a race.
@masnax, does the preseed case actually side-step the disk selection/table creation race?
Yes, it doesn't use the tables at all
@simondeziel @masnax tests are ok.
Just a general design question, not sure if it makes sense at all:
Is it reasonable to expect per-disk encryption? Since the data is replicated should the encryption be all-or-nothing and not per-disk?
Just a general design question, not sure if it makes sense at all:
Is it reasonable to expect per-disk encryption? Since the data is replicated should the encryption be all-or-nothing and not per-disk?
The MicroCeph API seems to only works with a per-disk encryption basis (see here). But if I'm following you, do you mean that, among the chosen disks for distributed storage, and if we chose to have encryption, we enforce the fact that either all the disks are encrypted or not at all?
Just a general design question, not sure if it makes sense at all: Is it reasonable to expect per-disk encryption? Since the data is replicated should the encryption be all-or-nothing and not per-disk?
The MicroCeph API seems to only works with a per-disk encryption basis (see here). But if I'm following you, do you mean that, among the chosen disks for distributed storage, and if we chose to have encryption, we enforce the fact that either all the disks are encrypted or not at all?
Yes I meant just having a yes/no question for disk encryption that sets the field for every disk payload that we send to the MicroCeph API.
@simondeziel in the 'basic' test suite (test_instances_config
) I have a Error: Failed to run: zpool create -f -m none -O compression=on local /dev/sdb: exit status 1 (/dev/sdb is in use and contains a unknown filesystem.)
.. it seems that somehow a device has not been properly wiped. Do you know from where it could come from?
@simondeziel in the 'basic' test suite (
test_instances_config
) I have aError: Failed to run: zpool create -f -m none -O compression=on local /dev/sdb: exit status 1 (/dev/sdb is in use and contains a unknown filesystem.)
.. it seems that somehow a device has not been properly wiped. Do you know from where it could come from?
IIRC, @masnax ran into this multiple times and a theory was that LXD didn't properly wipe the disks sometimes.
The "is in use" here is a bit more worrying than not being properly wiped though.
@simondeziel in the 'basic' test suite (
test_instances_config
) I have aError: Failed to run: zpool create -f -m none -O compression=on local /dev/sdb: exit status 1 (/dev/sdb is in use and contains a unknown filesystem.)
.. it seems that somehow a device has not been properly wiped. Do you know from where it could come from?IIRC, @masnax ran into this multiple times and a theory was that LXD didn't properly wipe the disks sometimes.
The "is in use" here is a bit more worrying than not being properly wiped though.
I believe he originally thought that but then found it was a duffrrebt issue
@simondeziel in the 'basic' test suite (
test_instances_config
) I have aError: Failed to run: zpool create -f -m none -O compression=on local /dev/sdb: exit status 1 (/dev/sdb is in use and contains a unknown filesystem.)
.. it seems that somehow a device has not been properly wiped. Do you know from where it could come from?IIRC, @masnax ran into this multiple times and a theory was that LXD didn't properly wipe the disks sometimes. The "is in use" here is a bit more worrying than not being properly wiped though.
I believe he originally thought that but then found it was a duffrrebt issue
No, it's the same issue, which is that zpool destroy local
can sometimes silently fail to destroy the zpool. I believe the solution we came to was to update LXD to attempt zpool destroy -f local
as a fallback, but I haven't actually made a PR for that yet.
I've submitted a PR for it now: https://github.com/canonical/lxd/pull/13513
@simondeziel in the 'basic' test suite (
test_instances_config
) I have aError: Failed to run: zpool create -f -m none -O compression=on local /dev/sdb: exit status 1 (/dev/sdb is in use and contains a unknown filesystem.)
.. it seems that somehow a device has not been properly wiped. Do you know from where it could come from?IIRC, @masnax ran into this multiple times and a theory was that LXD didn't properly wipe the disks sometimes. The "is in use" here is a bit more worrying than not being properly wiped though.
I believe he originally thought that but then found it was a duffrrebt issue
No, it's the same issue, which is that
zpool destroy local
can sometimes silently fail to destroy the zpool. I believe the solution we came to was to update LXD to attemptzpool destroy -f local
as a fallback, but I haven't actually made a PR for that yet.I've submitted a PR for it now: canonical/lxd#13513
Id like to know more about the scenario and when it occurs before we potentially paper over a separate issue with "-f" incorrectly.
@simondeziel in the 'basic' test suite (
test_instances_config
) I have aError: Failed to run: zpool create -f -m none -O compression=on local /dev/sdb: exit status 1 (/dev/sdb is in use and contains a unknown filesystem.)
.. it seems that somehow a device has not been properly wiped. Do you know from where it could come from?IIRC, @masnax ran into this multiple times and a theory was that LXD didn't properly wipe the disks sometimes. The "is in use" here is a bit more worrying than not being properly wiped though.
I believe he originally thought that but then found it was a duffrrebt issue
No, it's the same issue, which is that
zpool destroy local
can sometimes silently fail to destroy the zpool. I believe the solution we came to was to update LXD to attemptzpool destroy -f local
as a fallback, but I haven't actually made a PR for that yet. I've submitted a PR for it now: canonical/lxd#13513Id like to know more about the scenario and when it occurs before we potentially paper over a separate issue with "-f" incorrectly.
As discussed here: https://github.com/canonical/microcloud/pull/300#discussion_r1589304021
zpool destroy <pool>
will sometimes return successfully but fail to actually do anything. In such cases, zpool destroy -f <pool>
will correctly remove the pool.
@simondeziel in the 'basic' test suite (
test_instances_config
) I have aError: Failed to run: zpool create -f -m none -O compression=on local /dev/sdb: exit status 1 (/dev/sdb is in use and contains a unknown filesystem.)
.. it seems that somehow a device has not been properly wiped. Do you know from where it could come from?IIRC, @masnax ran into this multiple times and a theory was that LXD didn't properly wipe the disks sometimes. The "is in use" here is a bit more worrying than not being properly wiped though.
I believe he originally thought that but then found it was a duffrrebt issue
No, it's the same issue, which is that
zpool destroy local
can sometimes silently fail to destroy the zpool. I believe the solution we came to was to update LXD to attemptzpool destroy -f local
as a fallback, but I haven't actually made a PR for that yet. I've submitted a PR for it now: canonical/lxd#13513Id like to know more about the scenario and when it occurs before we potentially paper over a separate issue with "-f" incorrectly.
As discussed here: #300 (comment)
zpool destroy <pool>
will sometimes return successfully but fail to actually do anything. In such cases,zpool destroy -f <pool>
will correctly remove the pool.
Ah ok so we should expect this to reveal the underlying issue rather than fix it. Got you.
Ah ok so we should expect this to reveal the underlying issue rather than fix it. Got you.
Well, not really. It's definitely just hand-waving the issue because we are successfully removing the pool with --force
. As for why that works, and why not using --force
can return without an error but with the pool still existing is a mystery to me.
I've submitted a PR for it now: canonical/lxd#13513
Id like to know more about the scenario and when it occurs before we potentially paper over a separate issue with "-f" incorrectly.
I'd be curious to see if there is anything of interest in zpool events -v <pool>
when the first zpool destroy <pool>
command fails.
As for why that works, and why not using
--force
can return without an error but with the pool still existing is a mystery to me.
This is why im hesitant to merge. Doesnt seem like weve got a good enough handle on what's happening yet to make an informed decision either way. You can't forceably unmount in Linux (only defer in background).
@simondeziel And here's what I get when I grep for destroy
:
May 28 2024 21:19:02.108750009 sysevent.fs.zfs.pool_destroy
version = 0x0
class = "sysevent.fs.zfs.pool_destroy"
--
pool_state = 0x0
pool_context = 0x6
history_hostname = "micro01"
history_internal_str = "feature@async_destroy=enabled"
--
history_hostname = "micro01"
history_dsname = "local/virtual-machines/v1.block"
history_internal_str = "(bptree, mintxg=1)"
history_internal_name = "destroy"
--
history_hostname = "micro01"
history_dsname = "local/virtual-machines/v1"
history_internal_str = "(bptree, mintxg=1)"
history_internal_name = "destroy"
EDIT:
Nevermind, this was a separate issue with the non-snapshot tests. I was hoping I could get a quick reproducer, but not quite yet :(
"feature@async_destroy=enabled"
This seems like a feature we'd probably want to disable prior to doing the zpool destroy
from LXD. Presumably this would get us the error synchronously.
@simondeziel do you know why we see a Error: Failed instance creation: Failed creating instance from image: Failed to activate volume "f06aa34c14c9863617b1b2b2206bb62ffb085cd9027dc966103565fa6a10b10c"
in the test setup phase here?
@simondeziel do you know why we see a
Error: Failed instance creation: Failed creating instance from image: Failed to activate volume "f06aa34c14c9863617b1b2b2206bb62ffb085cd9027dc966103565fa6a10b10c"
in the test setup phase here?
I believe that's a known (recent) regression @MusicDin's already working on fixing on https://github.com/canonical/lxd/pull/13662
@MusicDin @simondeziel are the failing tests here related to https://github.com/canonical/lxd/pull/13662 ? We actually don't have much information on why the VMs are failing to start here :/ Is it an other regression observed in LXD?
The regresion in zfs volume activation should be fixed now. Am I properly interpreting that the tests are failing because the expected status is READY but reported status (by LXD) is RUNNING?
+ for round in $(seq 1 5 150)
++ lxc list -f csv -c s micro01
+ '[' RUNNING = READY ']'
+ '[' 26 = 0 ']'
+ sleep 5
+ echo ' micro01 VM failed to start'
micro01 VM failed to start
The regresion in zfs volume activation should be fixed now. Am I properly interpreting that the tests are failing because the expected status is READY but reported status (by LXD) is RUNNING?
+ for round in $(seq 1 5 150) ++ lxc list -f csv -c s micro01 + '[' RUNNING = READY ']' + '[' 26 = 0 ']' + sleep 5 + echo ' micro01 VM failed to start' micro01 VM failed to start
Should be fixed now in latest/edge https://launchpad.net/~canonical-lxd/lxd/+snap/lxd-latest-edge
wait i take that back...
wait i take that back...
no it should be fine in latest/edge, but not 5.21/edge or 5.0/edge yet
no it should be fine in latest/edge, but not 5.21/edge or 5.0/edge yet
I may be wrong, but this does not seem as the volume activation issue - because at the end of the test it reports that the instance is running, it also gets the IP addresses. Yesterday it was failing with something similar to Failed to activate volume ..
++ lxc list -f csv -c s micro01
+ '[' RUNNING = READY ']'
+ '[' 26 = 0 ']'
+ sleep 5
+ echo ' micro01 VM failed to start'
micro01 VM failed to start
+ return 1
+ cleanup
+ set -eux
+ lxc remote switch local
+ lxc project switch microcloud-test
+ set +e
+ '[' setup = setup ']'
+ '[' failure = success ']'
+ '[' -n '' ']'
+ lxc list --all-projects
+-----------------+---------+---------+-----------------------+------------------------------------------------+-----------------+-----------+
| PROJECT | NAME | STATE | IPV4 | IPV6 | TYPE | SNAPSHOTS |
+-----------------+---------+---------+-----------------------+------------------------------------------------+-----------------+-----------+
| microcloud-test | micro01 | RUNNING | 10.57.120.50 (enp5s0) | fd42:51b:91c0:6cc9:216:3eff:fe65:4879 (enp5s0) | VIRTUAL-MACHINE | 0 |
+-----------------+---------+---------+-----------------------+------------------------------------------------+-----------------+-----------+
But the test is expecting READY
state, not RUNNING
?
https://github.com/canonical/microcloud/blob/main/test/includes/microcloud.sh#L1108-L1131
I may be wrong, but this does not seem as the volume activation issue - because at the end of the test it reports that the instance is running, it also gets the IP addresses. Yesterday it was failing with something similar to
Failed to activate volume ..
Yep, the ZFS activation issue is no longer a blocker (thanks) and it seems like the add
test failed due to a timeout reaching READY
state, probably due to a slower than normal runner. I'll try to switch to the bigger runners now if that's quick.
@tomponline @simondeziel after sshing into the runner with tmate (thanks @tomponline for the tip), I noticed that the cloud-init script not setting the VM to a READY
state issue, was somehow randomly happening (I created many VMs with the same cloud-init profile and some were ever RUNNING
, some were READY
but there were just a few)
I noticed that settings limits.cpu=4
instead of 2
dramatically increased the rate of success though. I also wrapped the lxd_create / lxd_wait
logic in a short retry loop to eliminate the outliers... It seems to work now so far.
I noticed that settings
limits.cpu=4
instead of2
dramatically increased the rate of success though. I also wrapped thelxd_create / lxd_wait
logic in a short retry loop to eliminate the outliers... It seems to work now so far.
I'm fine with that especially since we are starting VMs serially (CONCURRENT_SETUP=0
), this might cause issue to those running them concurrently like @masnax does locally IIRC.
I noticed that settings
limits.cpu=4
instead of2
dramatically increased the rate of success though. I also wrapped thelxd_create / lxd_wait
logic in a short retry loop to eliminate the outliers... It seems to work now so far.I'm fine with that especially since we are starting VMs serially (
CONCURRENT_SETUP=0
), this might cause issue to those running them concurrently like @masnax does locally IIRC.
No issue from me, but I hope the 3-node tests continue to be happy with the over-commitment.
@gabrielmougard looks like the test package needs a rebase now.
@gabrielmougard Another rebase, sorry :)
@gabrielmougard You need to remove the systemd-journald
related masks from microcloud.sh
as that seems to be what's breaking cloud-init.
@gabrielmougard You need to remove the
systemd-journald
related masks frommicrocloud.sh
as that seems to be what's breaking cloud-init.
Indeed, looks like it was lost in the merge conflict resolution. There should only be one masked systemd-journald
and that's the systemd-journal-flush.service
one. The other 2 should be gone.
JIRA ticket: https://warthogs.atlassian.net/browse/LXD-910
We would like to be able to configure Ceph full disk encryption (FDE) as part of a MicroCloud deployment. The existing MicroCeph API already allows to pass an
Encrypt
boolean field to encrypt data at-rest at the disk level through the use of LUKS. Each generated keyring for the associated disk OSD is stored in the internel Ceph configuration folder.We propose to introduce a new
--encrypt
flag for themicrocloud init | add
commands to encrypt all the disks selected to be part of the Ceph deployment. We also introduce a way to select the disk to encrypt just like how it is currently done with the disk wipe operation.One important thing to note is that, Ceph will use
dm_crypt
under the hood to create a virtual encrypted device for each selected disk device. If choosing to encrypt a disk on a certain node, the user must ensure that thedm_crypt
kernel module is enabled (it is by default enabled for LXDubuntu:24.04
VM images :grep -i dm_crypt config-6.8.0-31-generic
should outputCONFIG_DM_CRYPT=m
)TODO