canonical / lxd

Powerful system container and virtual machine manager
https://canonical.com/lxd
GNU Affero General Public License v3.0
4.27k stars 910 forks source link

Unable to start instances on zfs backed LXD if the pool reports as DEGRADED #13641

Open webdock-io opened 1 week ago

webdock-io commented 1 week ago

LXD v5.21.1 LTS Ubuntu Noble

# lxc start foo
Error: Unable to find backing block for zfs pool lxd

If a zpool status reports your pool as anything but "ONLINE" lxd will fail with this error. This does not seem to have been the case in the past, as I'm sure we've had degraded pools and did not see this before. Looking at the LXD source it seems like you are explicitly matching the string "ONLINE" against the output from zpool status, and if not found you hard fail.

This is incorrect behavior. A degraded pool does not mean it is non-functional and that lxd should not proceed.

tomponline commented 1 week ago

According to https://docs.oracle.com/cd/E19253-01/819-5461/gbcve/

The state can be one of the following: ONLINE, FAULTED, DEGRADED, UNAVAIL, or OFFLINE. If the state is anything but ONLINE, the fault tolerance of the pool has been compromised.

The offending error is from here:

https://github.com/canonical/lxd/blob/f9f88f4e77ae2746f9a7ae004b89b3c5003cae6d/lxd/device/disk.go#L2323

And as OP says, is caused by this filter for ONLINE:

https://github.com/canonical/lxd/blob/f9f88f4e77ae2746f9a7ae004b89b3c5003cae6d/lxd/device/disk.go#L2287-L2289

@simondeziel this function is used to get the parent blocks in order to calculate shared disk limits to apply.

In this case I think its safe to allow degraded pools to still be considered as parents for disk limits.

What do you think? Are there any other states we should also consider?

simondeziel commented 1 week ago

In this case I think its safe to allow degraded pools to still be considered as parents for disk limits.

Agreed.

What do you think? Are there any other states we should also consider?

Looking at the other possible states on https://openzfs.github.io/openzfs-docs/man/master/7/zpoolconcepts.7.html#Device_Failure_and_Recovery, I think you are right that only ONLINE and DEGRADED should be considered OK.