canonical / lxd

Powerful system container and virtual machine manager
https://canonical.com/lxd
GNU Affero General Public License v3.0
4.38k stars 931 forks source link

BTRFS quota is reached when filling up VM disk image file #9124

Closed tomponline closed 2 years ago

tomponline commented 3 years ago

Even though LXD sets the BTRFS quota 100MiB (by default) larger than the VM disk file max size, if the VM uses all of its disk space the underlying BTRFS filesystem sees that the referenced disk quota has been reached and prevents LXD from starting the VM because it cannot write the backup file. Even though there should be some space free.

It seems like BTRFS quota isn't working the way we think it does.

See https://discuss.linuxcontainers.org/t/btrfs-issues-storage-pools-btrfs-empty-and-btrfs-quota-100-while-inside-the-vm-only-48-utilized/11897

Steps to reproduce:

lxc storage create btrfs btrfs
lxc init images:ubuntu/focal/cloud v1 --vm -s btrfs

# This should do 2 things; increase disk image file size to 11GB, and enable BTRFS disk quotas to a size of 11GB+100MiB (which is the default `volume.state` size) to account for the entirety of the disk size, and allow 100MiB for volume state usage.
lxc config device set v1 root size=11GB # Accounting for VM image file size sizeBytes=11104862208 blockSize=11000004608

Check BTRFS quota set (expect it to be blockSize 11000004608 + 100MiB (104857600 bytes) = 11104862208 bytes):

sudo btrfs subvolume show --raw /var/lib/lxd/storage-pools/btrfs/virtual-machines/v1
virtual-machines/v1
    Name:           v1
    UUID:           5816cce0-0dba-3348-bd29-14b80dfdbc85
    Parent UUID:        06d1312f-5158-f44e-89a0-999e132ee7bc
    Received UUID:      -
    Creation time:      2021-12-01 12:27:57 +0000
    Subvolume ID:       272
    Generation:         2516
    Gen at creation:    2515
    Parent ID:      5
    Top level ID:       5
    Flags:          -
    Snapshot(s):
    Quota group:        0/272
      Limit referenced: 11104862208 # Matches requested sizeBytes of 11104862208
      Limit exclusive:  -
      Usage referenced: 2361466880
      Usage exclusive:  57344

Check size of root disk file (expect it to be 11000004608 bytes):

sudo du -b /var/lib/lxd/storage-pools/btrfs/virtual-machines/v1/root.img
11000004608 /var/lib/lxd/storage-pools/btrfs/virtual-machines/v1/root.img

Start the VM:

lxc start v1
lxc shell v1

Now inside v1 run until the disk fills up (should fill up the disk image but not reach BTRFS quota as it has another 100MiB allowed):

cat /dev/urandom > /root/foo.bin
cat: write error: Read-only file system

You can now see that the BTRFS referenced quota has been reached, which it shouldn't have been.

sudo btrfs subvolume show --raw /var/lib/lxd/storage-pools/btrfs/virtual-machines/v1
virtual-machines/v1
    Name:           v1
    UUID:           5816cce0-0dba-3348-bd29-14b80dfdbc85
    Parent UUID:        06d1312f-5158-f44e-89a0-999e132ee7bc
    Received UUID:      -
    Creation time:      2021-12-01 12:27:57 +0000
    Subvolume ID:       272
    Generation:         3246
    Gen at creation:    2515
    Parent ID:      5
    Top level ID:       5
    Flags:          -
    Snapshot(s):
    Quota group:        0/272
      Limit referenced: 11104862208
      Limit exclusive:  -
      Usage referenced: 11104841728 # Limit reached, so how to explain if disk size is still 11000004608.
      Usage exclusive:  9011888128

Disk image is still at set size of 11000004608.

sudo du -b /var/lib/lxd/storage-pools/btrfs/virtual-machines/v1/root.img
11000004608 /var/lib/lxd/storage-pools/btrfs/virtual-machines/v1/root.img

And the actual used blocks of the image are:

du  -B1 root.img 
10758160384 root.img

Indeed the total size of the volume is less than the quota:

du  -B1 
20480   ./templates
8192    ./config/systemd
4096    ./config/udev
4096    ./config/files
14815232    ./config
10773151744 .
jamielsharief commented 3 years ago

Could it be related to this https://github.com/lxc/lxd/issues/8468

stgraber commented 3 years ago

It's possible, yeah. btrfs quotas are really frustrating to work with and very odd compared to what you get on ZFS or through project quotas on ext4/xfs, for one thing, it seems somewhat asynchronous, making it possible to exceed the limit by a few hundred MBs before the quota kicks in. This may well be the source of the issue here but we'll need to investigate some more.

tomponline commented 3 years ago

Yes I noticed that when I looked into it initially, the quota consumption kept changing even thought the actual instance wasn't running.

jamielsharief commented 3 years ago

Issue #8468, is about incorrect disk usage, as seen in the examples, it says the usage was 9MB despite the usage being 500MB. If i remember correctly creating a snapshot resets the usage.

It was an other issue where i reported the delay of information.

jamielsharief commented 3 years ago

So I am thinking that calculation of usage is somehow calculating the difference in the filesystem and the most recent snapshot, as opposed to the original image.

tomponline commented 3 years ago

Its not even that simple because if you wait some time thought you'll see the utilisation change on the original volume too as BTRFS performs an async usage scan after taking the snapshot.

jamielsharief commented 3 years ago

I think submitted that as a different issue, but my understanding was that is how the storage driver works, so deal with it. Its mainly noticeable to users when using a web UI for lxd. Using the command line, you probably wont notice it.

tomponline commented 3 years ago

Yes its not ideal if the BTRFS reporting tools are async. But if it allows users to exceed their quotas due to that async nature then its pretty nasty. And this is without even using snapshots (apart from the initial source one that is).

There's also the concept of referenced data limit vs exclusive data limit which I've not fully understood the ramifications of yet.

jamielsharief commented 3 years ago

I think we are talking about two different problems, the problem I discovered was that when creating a snapshot using custom BTRFS partition, it resets the usage to almost nothing. This means that the reported use is way lower that is available, so i am not sure if that later leads to the problem reported above.

tomponline commented 3 years ago

In this case I am talking about the example above.

jamielsharief commented 3 years ago

No problems, i just saw the issue come through, and it reminded me of similar problems, which could create bugs if the usage information is used by the API for something else.

stgraber commented 3 years ago

@tomponline assigning to you so we can decide what to do with this.

If the issue is that btrfs applies quotas asynchronously, then I suspect the only thing we can do is mention it in doc/storage.md and close the issue. For those affected, setting the size.state property on the root device to something quite large like 1GiB should do the trick to let you start things back up, but it's not something that I think we should be doing for the users.

tomponline commented 2 years ago

@stgraber looking into this now.

jamielsharief commented 2 years ago

Its been a while since i encountered this, but let me know if I can help.

tomponline commented 2 years ago

To rule out any issues with snapshots and quota accounting I created a VM, exported it as an non-optimised backup and then re-imported it so it wasn't linked to any existing image subvolume.

After import I set the disk size to 12GiB and then filled it up as normal.

sudo btrfs subvolume show --raw /var/lib/lxd/storage-pools/btrfs/virtual-machines/v1
virtual-machines/v1
    Name:           v1
    UUID:           35c8e59c-e13e-b049-959f-66746bb10a4f
    Parent UUID:        -
    Received UUID:      -
    Creation time:      2021-12-01 14:34:18 +0000
    Subvolume ID:       307
    Generation:         28884
    Gen at creation:    6547
    Parent ID:      5
    Top level ID:       5
    Flags:          -
    Snapshot(s):
    Quota group:        0/307
      Limit referenced: 13409189888
      Limit exclusive:  -
      Usage referenced: 13409112064
      Usage exclusive:  13409112064

I checked using btrfs fi du and du -B1 and they agree on the size of the volume's root disk file:

sudo du -B1 /var/lib/lxd/storage-pools/btrfs/virtual-machines/v1/root.img
12101988352 /var/lib/lxd/storage-pools/btrfs/virtual-machines/v1/root.img
sudo btrfs fi du --raw  /var/lib/lxd/storage-pools/btrfs/virtual-machines/v1/root.img
     Total   Exclusive  Set shared  Filename
12101988352  12101988352           0  /var/lib/lxd/storage-pools/btrfs/virtual-machines/v1/root.img

So it seems the quota group is reporting the incorrect info, as even the total for the subvolume is larger than what btrfs fi du reports for the whole subvolume.

tomponline commented 2 years ago

Here's something curious, if you create an empty VM, and then manually setup a loop device, filesystem, mount it and then fill up that filesystem, it doesn't exceed the disk image file's size and doesn't reach the BTRFS quota.

lxc init v1 --empty --vm -s btrfs
lxc config device set v1 root size=11GiB
losetup -f /var/lib/lxd/storage-pools/btrfs/virtual-machines/v1/root.img --show
/dev/loop21
mkfs.ext4 /dev/loop21
mount /dev/loop21 /mnt
dd if=/dev/random of=/mnt/foo
dd: writing to '/mnt/foo': No space left on device
22460761+0 records in
22460760+0 records out
11499909120 bytes (11 GB, 11 GiB) copied, 189.115 s, 60.8 MB/s
sudo btrfs subvolume show --raw /var/lib/lxd/storage-pools/btrfs/virtual-machines/v1
virtual-machines/v1
    Name:           v1
    UUID:           44d51cde-aa4e-874c-8bcc-344d454ec955
    Parent UUID:        -
    Received UUID:      -
    Creation time:      2021-12-01 16:24:35 +0000
    Subvolume ID:       373
    Generation:         35468
    Gen at creation:    35439
    Parent ID:      5
    Top level ID:       5
    Flags:          -
    Snapshot(s):
    Quota group:        0/373
      Limit referenced: 11916017664
      Limit exclusive:  -
      Usage referenced: 11572088832
      Usage exclusive:  11572088832

And yet doing the same dd command inside the VM will fill up the BTRFS quota.

tomponline commented 2 years ago

I even tried using losetup to create a loop device for the disk image and then manually modified LXD to use the /dev/loop21 device as its root disk rather than using the root file directly, and the same effect, the BTRFS quota was exceeded. It looks like some issue between the interplay of QEMU and BTRFS quotas.

jamielsharief commented 2 years ago

I think the issue #8468 covers when creating snapshots also it changed things as well.

stgraber commented 2 years ago

@tomponline possibly has to do with the size of the writes. You could probably try dd with something like bs=4M conv=fdatasync? I wonder if btrfs re-computes the quota on sync and not on write or something.

tomponline commented 2 years ago

@stgraber I've also tried this with a BTRFS storage pool on a raw NVME device, and not on a loopdev, to avoid any issues with loopdev on loopdev. But the same thing.

It seems that I need to set size.state to 1098MiB to allow a 12GiB disk image to be filled without also causing the BTRFS quota to be reached.

tomponline commented 2 years ago

@tomponline possibly has to do with the size of the writes. You could probably try dd with something like bs=4M conv=fdatasync? I wonder if btrfs re-computes the quota on sync and not on write or something.

I did start to think perhaps its a fragmentation issue, which could also be affected by block size. This would explain why the quota extents are larger than how BTRFS is tracking the actual used file sizes.

tomponline commented 2 years ago

Same thing, but slightly different. It looks like using a larger block size still ends up exceeding the quota, but not by so much.

First try and fill a 12GiB disk:

root@v1:~# dd if=/dev/urandom of=/root/foo.img bs=4M conv=fdatasync
dd: error writing '/root/foo.img': Read-only file system
2591+0 records in
2590+0 records out
10866253824 bytes (11 GB, 10 GiB) copied, 131.96 s, 82.3 MB/s

Shows the BTRFS quota reached, and causes VM filesystem to be remounted readonly due to I/O errors.

sudo btrfs subvolume show --raw /var/lib/lxd/storage-pools/btrfs/virtual-machines/v1
virtual-machines/v1
    Name:           v1
    UUID:           5415bc21-30ab-3e49-a13d-c187d2e4abcd
    Parent UUID:        8b7b65af-066a-e441-83d0-100567859844
    Received UUID:      -
    Creation time:      2021-12-01 17:57:06 +0000
    Subvolume ID:       278
    Generation:         3000
    Gen at creation:    522
    Parent ID:      5
    Top level ID:       5
    Flags:          -
    Snapshot(s):
    Quota group:        0/278
      Limit referenced: 12989759488
      Limit exclusive:  -
      Usage referenced: 12989743104
      Usage exclusive:  10804129792

Stop the VM and check the BTRFS quota still exceeded by manually setting the size.state.

lxc stop v1
lxc config device set v1 root size.state=100MiB
Error: Failed to write backup file: Failed to create file "/var/lib/lxd/virtual-machines/v1/backup.yaml": open /var/lib/lxd/virtual-machines/v1/backup.yaml: disk quota exceeded

Now, set a very large size.state, and repeated the previous experiment, which now succeeds in that we still fill up the root image file, but get "No space left on device" rather than "Read-only file system" because the BTRFS quota hasn't been reached.

lxc config device set v1 root size.state=10GiB
lxc start v1
lxc shell v1
root@v1:~# dd if=/dev/urandom of=/root/foo.img bs=4M conv=fdatasync
dd: error writing '/root/foo.img': No space left on device
2720+0 records in
2719+0 records out
11407343616 bytes (11 GB, 11 GiB) copied, 137.557 s, 82.9 MB/s

Now try to find how much it would have gone over quota by reducing the size.state incrementally.

lxc stop v1
lxc config device set v1 root size.state=100MiB
Error: Failed to write backup file: Failed to create file "/var/lib/lxd/virtual-machines/v1/backup.yaml": open /var/lib/lxd/virtual-machines/v1/backup.yaml: disk quota exceeded
...
lxc config device set v1 root size.state=969MiB

So about 969MiB over compared to 1098MiB. Not much in it though.

@stgraber so its not the async nature of the quota that matters in this case, because the disk image size itself should (and does) prevent the filesystem from growing beyond the root image disk size.

The issue is that BTRFS is seeing that usage as exceeding its quota by a large amount, even though its own filesystem du tool shows the disk image at the size expected. Starting to feel like a BTRFS bug.

stgraber commented 2 years ago

What kernel are you testing on?

tomponline commented 2 years ago

Focal hwe (5.11.0-41-generic)

stgraber commented 2 years ago

Ok, I guess you could run a quick test on linux-generic-hwe-20.04-edge (5.13) but that's not sounding too promising.

tomponline commented 2 years ago

Tried it with 5.13.0-22-generic and same result I'm afraid.

stgraber commented 2 years ago

@tomponline and just to be sure, we're feeding those btrfs quotas in bytes not in MB/MiB? Otherwise it could be a unit issue if say we create the img file using the exact size in bytes from a value in MB/GB but then set the quota in MiB/GiB instead.

stgraber commented 2 years ago

If that's all good, then it'd be good to send a minimal reproducer (ideally without lxd/qemu involved) to the btrfs mailing-list, reference it here and close this issue.

tomponline commented 2 years ago

I've not been able to reproduce it without QEMU involved sadly, using a loopdev on the host doesn't exhibit the issue.

See https://github.com/lxc/lxd/issues/9124#issuecomment-983825398

I'll triple check the setup of the btrfs quotas and disk image size.

tomponline commented 2 years ago

Running lxc config device set v1 root size=11GiB on a fresh VM on a BTRFS storage pool results in this debug log output:

DBUG[12-02|17:34:59] SetInstanceQuota started                 driver=btrfs pool=btrfs project=default instance=v1 size=11GiB vm_state_size=
DBUG[12-02|17:35:00] Moved GPT alternative header to end of disk driver=btrfs pool=btrfs dev=/var/lib/lxd/storage-pools/btrfs/virtual-machines/v1/root.img
DBUG[12-02|17:35:00] Accounting for VM image file size        driver=btrfs pool=btrfs sizeBytes=11916017664
DBUG[12-02|17:35:00] SetInstanceQuota finished                driver=btrfs pool=btrfs project=default instance=v1 size=11GiB vm_state_size=

So we can see the actual size for the BTRFS quota is being calculated at 11916017664 bytes which is 11GiB (11811160064 bytes) for the root disk file + 100MiB (104857600 bytes) for the state files = 11916017664 bytes.

And we can see from sudo btrfs subvolume show --raw /var/lib/lxd/storage-pools/btrfs/virtual-machines/v1 that the "Limit Referenced" value is 11916017664, which matches what LXD was requesting.

The /var/lib/lxd/storage-pools/btrfs/virtual-machines/v1/root.img is correctly sized to 11GiB (11811160064 bytes):

sudo du -b /var/lib/lxd/storage-pools/btrfs/virtual-machines/v1/root.img
11811160064 /var/lib/lxd/storage-pools/btrfs/virtual-machines/v1/root.img

So the VM's filesystems should not be able to exceed 11811160064 bytes, and when setting size.state=10GiB (to allow the root disk filesystem to fill up without reaching BTRFS' quota), and then running inside the VM:

dd if=/dev/urandom of=/root/foo.img bs=4M conv=fdatasync
dd: error writing '/root/foo.img': No space left on device
2474+0 records in
2473+0 records out
10374795264 bytes (10 GB, 9.7 GiB) copied, 126.42 s, 82.1 MB/s

Then according to sudo btrfs fi du --raw /var/lib/lxd/storage-pools/btrfs/virtual-machines/v1/root.img it does not exceed the BTRFS quota, nor the size of the root disk file 11811160064 bytes (as expected):

sudo btrfs fi du --raw  /var/lib/lxd/storage-pools/btrfs/virtual-machines/v1/root.img
     Total   Exclusive  Set shared  Filename
11806830592  11806830592           0  /var/lib/lxd/storage-pools/btrfs/virtual-machines/v1/root.img

But for some reason when using QEMU, BTRFS seems to think that the quota has been exceeded, so when trying to reduce the size.state back down to the default 100MiB it won't allow it even though we know the disk image size hasn't been exceeded.

lxc config device set v1 root size.state=100MiB
Error: Failed to write backup file: Failed to create file "/var/lib/lxd/virtual-machines/v1/backup.yaml": open /var/lib/lxd/virtual-machines/v1/backup.yaml: disk quota exceeded
lxc config device set v1 root size.state=970MiB # Smallest size allowed by BTRFS

So now we see the "Limit Referenced" bytes that BTRFS thinks has been used:

sudo btrfs subvolume show --raw /var/lib/lxd/storage-pools/btrfs/virtual-machines/v1
virtual-machines/v1
    Name:           v1
    UUID:           c7408307-e584-9e45-b9eb-98d740999f89
    Parent UUID:        8b7b65af-066a-e441-83d0-100567859844
    Received UUID:      -
    Creation time:      2021-12-02 17:33:18 +0000
    Subvolume ID:       266
    Generation:         7350
    Gen at creation:    5842
    Parent ID:      5
    Top level ID:       5
    Flags:          -
    Snapshot(s):
    Quota group:        0/266
      Limit referenced: 12828278784
      Limit exclusive:  -
      Usage referenced: 12828004352
      Usage exclusive:  12828004352
Limit referenced:   12828278784
tomponline commented 2 years ago

@stgraber the thing I cant figure out is why this is only apparently happening when accessing the disk file QEMU, I've even tried getting QEMU to access the disk file via a manually setup loopdev (so passing /dev/loop21 to QEMU rather than /var/lib/lxd/storage-pools/btrfs/virtual-machines/v1/root.img) and it had the same behaviour. So it seems to be an interplay between BTRFS quotas and something that QEMU's I/O is doing that is causing problems.

stgraber commented 2 years ago

@tomponline likely has to do with the API used by QEMU, may be related to asyncio with multiple I/O threads on the same block or something and btrfs incorrectly accounting for those writes.

stgraber commented 2 years ago

I'm closing the LXD issue as if there's one thing that's clear right now is that our quota and file size calculation is all correct, it's the enforcement which is problematic.

@tomponline can you send what you have to the btrfs mailing-list (or bug tracker if they have one) and we'll see if they come up with anything useful.

tomponline commented 2 years ago

I've asked on #btrfs IRC and if get no reply will post to linux-btrfs@vger.kernel.org

tomponline commented 2 years ago

Chatting on #btrfs IRC, forza (@Forza-tng ?) says (paraphrased):

Extents are immutable so when blocks are written to they end up in new extents and the old remains until all of its data is derefernced or rewritten. You'd need up to double quota to be safe. You have to allow for 200% space usage. And try compress-force and autodefrag. How well Autodefrag works depends on the workload. Extents in btrfs are immutable worst case is when only 4k of 128MiB (max extent size) is refereced and 128MiB-4k is wasted.

mutlicore says:

I'd probably use compress or compress-force with datacow vm images to limit the extent sizes compression limits max compressed extent size to 128KiB, uncompressed ones can still be 128MiB. Compress-force is the same, but it limits uncompressed extent sizes to 512KiB (currently as a side-effect of sorts)

tomponline commented 2 years ago

I was asked to run the compsize tool which also accounted for the extra usage:

sudo compsize --bytes /var/lib/lxd/storage-pools/btrfs/virtual-machines/v1/root.img
Type       Perc     Disk Usage   Uncompressed Referenced
TOTAL      100%     12967923712  12967923712  11807326208
none       100%     12967923712  12967923712  11807326208
tomponline commented 2 years ago

@stgraber so shall I look into the compress-force mount argument for BTRFS VM volumes and/or go for the belt-and-braces approach of using a BTRFS quota of <size.state size>+(2*<disk image size>)?

tomponline commented 2 years ago

@stgraber there's various info about compression on https://btrfs.wiki.kernel.org/index.php/Compression, but the problem I see with compress-force is that its a mount option and so would affect all files in the storage pool. You can enable compression on a per-file basis (https://btrfs.wiki.kernel.org/index.php/Compression#Can_I_force_compression_on_a_file_without_using_the_compress_mount_option.3F) but confusingly this doesn't enable compress-force for a file, but forcefully enables compress (which uses heuristics to decide whether or not to compress).

It doesn't look like you can enable compress-force for a single file.

stgraber commented 2 years ago

Hmm, so doubling the quota feels wrong as users will (rightfully) expect the volume to not exceed the quota they set...

The compress stuff has been absolutely horrible at causing filesystem corruption in the past and as it's indeed a global option for the entire filesystem, I'd be quite worried to turn it on.

One thing that comes to mind though is that I thought the recommendation was for VM images to be marked as nocow through a filesystem attribute. I wonder if that would improve this behaviour and what the downside would be.

tomponline commented 2 years ago

The compress stuff has been absolutely horrible at causing filesystem corruption in the past and as it's indeed a global option for the entire filesystem, I'd be quite worried to turn it on.

Isn't BTRFS wonderful :(

I thought the recommendation was for VM images to be marked as nocow

Interesting, I had no idea the nocow option existed, nor that it was the recommended option for VM images. I found this https://btrfs.wiki.kernel.org/index.php/FAQ#Can_copy-on-write_be_turned_off_for_data_blocks.3F and will see if that helps. It might well do given what we know now.

tomponline commented 2 years ago

@stgraber I'm afraid the nowdatacow option didn't work, I initially was encouraged, but after using snapshots of VMs from an image it didn't work. I've also checked that I am applying the +C attribute correctly, because it can only be applied to empty files.

I tested this by running chattr +C on an existing file and it didn't show as applied with lsattr (even though the chattr command didn't fail with an error - another bug then), whereas it did apply and show with lsattr when it was added to the empty root file before the image unpacker was run, and it was still showing as applied when on VM snapshot of the image volume.

Sadly it still managed to reached the quota, and compsize showed the same issue as before.

I wonder if we shouldn't use the 2x capacity approach, but rather than silently add it, check when applying the quota that it allows for 2x the disk image size?

tomponline commented 2 years ago

@stgraber Good morning. I've found out why I initially thought that the nodatacow option was working and then abruptly changed my mind. The reason is that initially I was testing on a VM I had imported from a backup (so it wasn't a snapshot of an image) whereas later I was testing on a VM that was created as a snapshot from an image volume.

Further reading on the subject of nodatacow revealed this post:

https://www.spinics.net/lists/linux-btrfs/msg35491.html

Second, there's the snapshotting exception. Because a btrfs snapshot locks the existing file data in place with the snapshot, the first modification to a fileblock after a snapshot will force a COW for that block, even on an otherwise nocow file. The nocow attribute remains in effect, however, and further writes to the same block will modify it in- place... until the next snapshot of course.

So the issue is that for VMs created as a snapshot from the VM image volume, the first write to a block will necessarily cause a CoW operation, and thus the VM volume's quota usage will be increased because it references both the old and new extents of that block (this is why compression helps because it reduces the maximum extent size and so the issue is not as exacerbated).

It gets worse though.

I've noticed that the backup import system currently only restores the primary volume's quota, not the state volumes' quota. This means that for BTRFS backup imports the subvolume is restored with no quota.

Fixing this issue then causes another serious problem.

Whilst, in the previous example the image volume size is set to the default 10GiB, the actual data usage size is whatever the image size is (for Ubuntu Focal its 4GiB approx). So there is some leeway before the quota is totally filled up.

However when exporting a VM to a non-optimized backup and then reimporting it, the full raw image file is written back to the subvolume. Combined with the fix size.state quota restoration, means that the disk file is effectively considered full from a BTRFS quota perspective. This means that if a snapshot is taken of that restored VM, any write to that VM will cause a CoW event, and almost immediately reach the subvolume quota (less than 100MiB of writes need to occur before it is reached).

So we are in a tricky position:

If we fix the backup restoration issue so that size.state quota is set correctly, then this will mean any subsequent snapshot of the restored VM will very quickly cause the source VM's disk to fail with I/O errors as it will hit the underlying BTRFS quota.

tomponline commented 2 years ago

@stgraber this is effectively the same issue as LVM has for non-thin volume snapshots (where snapshots have to be created with size that limits the total number of CoWs that can occur). For the LXD LVM driver, this has been addressed by creating the snapshot at the same size as the volume (effectively doubling the quota):

https://github.com/lxc/lxd/blob/master/lxd/storage/drivers/driver_lvm_utils.go#L397-L408

We effectively need to do the same and account for the BTRFS snapshot CoW, but using BTRFS semantics of doubling the quota (which is not as nice as the LVM approach, as we cannot assign that additional quota just for CoW usage).

I realise you said above:

Hmm, so doubling the quota feels wrong as users will (rightfully) expect the volume to not exceed the quota they set...

But as we already have a precedent for this in the LVM driver (i.e if I set an LVM volume to10GiB size and then take a snapshot then writes to the original LVM volume can now take up to 20GiB of space due to accounting for CoW of the snapshot), does this change your position?

stgraber commented 2 years ago

Not really as in the LVM case it still won't allow me to dump 10GiB of state data on top of the quota.

In the btrfs case, if we use the approach of silently doubling the quotas and we have users who happen to start using stateful stop/snapshot, they will be allowed to exceed their quotas, potentially by tens of GiB, completely messing up any chance of doing proper resource control on the system (thinking of shared environments with restricted projects combined with projects limits).

tomponline commented 2 years ago

Not really as in the LVM case it still won't allow me to dump 10GiB of state data on top of the quota.

This is because we have conflated the state and disk file quotas by not using a separate subvolume (without quota) for the disk image file, whereas with LVM there are separate LVs for state and root disk data.

In theory using a single subvolume made sense, but given how BTRFS does CoW accounting for quotas, using a separate subvolume, although a lot larger change, seems like the best approach to address this cleanly.

stgraber commented 2 years ago

It would still be papering over an upstream issue. Yes, doing two volumes would help a bit for the block case, but we'd still get that failure on the fs volume as that one would still need a limit and so would hit the bug if ever snapshotted.

Similarly, we could absolutely reproduce this issue with a container filesystem.

I'm usually not very keen on papering over other people's bugs especially if we can't take care of the entire issue in a consistent way.

I still think the best we can do here is document the btrfs issue and let people decide what they want to do. For most I'm hoping it will be staying away from btrfs while those who really want btrfs should probably consider compression.

tomponline commented 2 years ago

OK I will update the docs absolutely.