Closed tomponline closed 2 years ago
Could it be related to this https://github.com/lxc/lxd/issues/8468
It's possible, yeah. btrfs quotas are really frustrating to work with and very odd compared to what you get on ZFS or through project quotas on ext4/xfs, for one thing, it seems somewhat asynchronous, making it possible to exceed the limit by a few hundred MBs before the quota kicks in. This may well be the source of the issue here but we'll need to investigate some more.
Yes I noticed that when I looked into it initially, the quota consumption kept changing even thought the actual instance wasn't running.
Issue #8468, is about incorrect disk usage, as seen in the examples, it says the usage was 9MB despite the usage being 500MB. If i remember correctly creating a snapshot resets the usage.
It was an other issue where i reported the delay of information.
So I am thinking that calculation of usage is somehow calculating the difference in the filesystem and the most recent snapshot, as opposed to the original image.
Its not even that simple because if you wait some time thought you'll see the utilisation change on the original volume too as BTRFS performs an async usage scan after taking the snapshot.
I think submitted that as a different issue, but my understanding was that is how the storage driver works, so deal with it. Its mainly noticeable to users when using a web UI for lxd. Using the command line, you probably wont notice it.
Yes its not ideal if the BTRFS reporting tools are async. But if it allows users to exceed their quotas due to that async nature then its pretty nasty. And this is without even using snapshots (apart from the initial source one that is).
There's also the concept of referenced data limit vs exclusive data limit which I've not fully understood the ramifications of yet.
I think we are talking about two different problems, the problem I discovered was that when creating a snapshot using custom BTRFS partition, it resets the usage to almost nothing. This means that the reported use is way lower that is available, so i am not sure if that later leads to the problem reported above.
In this case I am talking about the example above.
No problems, i just saw the issue come through, and it reminded me of similar problems, which could create bugs if the usage information is used by the API for something else.
@tomponline assigning to you so we can decide what to do with this.
If the issue is that btrfs applies quotas asynchronously, then I suspect the only thing we can do is mention it in doc/storage.md and close the issue. For those affected, setting the size.state
property on the root device to something quite large like 1GiB should do the trick to let you start things back up, but it's not something that I think we should be doing for the users.
@stgraber looking into this now.
Its been a while since i encountered this, but let me know if I can help.
To rule out any issues with snapshots and quota accounting I created a VM, exported it as an non-optimised backup and then re-imported it so it wasn't linked to any existing image subvolume.
After import I set the disk size to 12GiB and then filled it up as normal.
sudo btrfs subvolume show --raw /var/lib/lxd/storage-pools/btrfs/virtual-machines/v1
virtual-machines/v1
Name: v1
UUID: 35c8e59c-e13e-b049-959f-66746bb10a4f
Parent UUID: -
Received UUID: -
Creation time: 2021-12-01 14:34:18 +0000
Subvolume ID: 307
Generation: 28884
Gen at creation: 6547
Parent ID: 5
Top level ID: 5
Flags: -
Snapshot(s):
Quota group: 0/307
Limit referenced: 13409189888
Limit exclusive: -
Usage referenced: 13409112064
Usage exclusive: 13409112064
I checked using btrfs fi du
and du -B1
and they agree on the size of the volume's root disk file:
sudo du -B1 /var/lib/lxd/storage-pools/btrfs/virtual-machines/v1/root.img
12101988352 /var/lib/lxd/storage-pools/btrfs/virtual-machines/v1/root.img
sudo btrfs fi du --raw /var/lib/lxd/storage-pools/btrfs/virtual-machines/v1/root.img
Total Exclusive Set shared Filename
12101988352 12101988352 0 /var/lib/lxd/storage-pools/btrfs/virtual-machines/v1/root.img
So it seems the quota group is reporting the incorrect info, as even the total for the subvolume is larger than what btrfs fi du
reports for the whole subvolume.
Here's something curious, if you create an empty VM, and then manually setup a loop device, filesystem, mount it and then fill up that filesystem, it doesn't exceed the disk image file's size and doesn't reach the BTRFS quota.
lxc init v1 --empty --vm -s btrfs
lxc config device set v1 root size=11GiB
losetup -f /var/lib/lxd/storage-pools/btrfs/virtual-machines/v1/root.img --show
/dev/loop21
mkfs.ext4 /dev/loop21
mount /dev/loop21 /mnt
dd if=/dev/random of=/mnt/foo
dd: writing to '/mnt/foo': No space left on device
22460761+0 records in
22460760+0 records out
11499909120 bytes (11 GB, 11 GiB) copied, 189.115 s, 60.8 MB/s
sudo btrfs subvolume show --raw /var/lib/lxd/storage-pools/btrfs/virtual-machines/v1
virtual-machines/v1
Name: v1
UUID: 44d51cde-aa4e-874c-8bcc-344d454ec955
Parent UUID: -
Received UUID: -
Creation time: 2021-12-01 16:24:35 +0000
Subvolume ID: 373
Generation: 35468
Gen at creation: 35439
Parent ID: 5
Top level ID: 5
Flags: -
Snapshot(s):
Quota group: 0/373
Limit referenced: 11916017664
Limit exclusive: -
Usage referenced: 11572088832
Usage exclusive: 11572088832
And yet doing the same dd
command inside the VM will fill up the BTRFS quota.
I even tried using losetup
to create a loop device for the disk image and then manually modified LXD to use the /dev/loop21 device as its root disk rather than using the root file directly, and the same effect, the BTRFS quota was exceeded.
It looks like some issue between the interplay of QEMU and BTRFS quotas.
I think the issue #8468 covers when creating snapshots also it changed things as well.
@tomponline possibly has to do with the size of the writes. You could probably try dd
with something like bs=4M conv=fdatasync
? I wonder if btrfs re-computes the quota on sync and not on write or something.
@stgraber I've also tried this with a BTRFS storage pool on a raw NVME device, and not on a loopdev, to avoid any issues with loopdev on loopdev. But the same thing.
It seems that I need to set size.state
to 1098MiB to allow a 12GiB disk image to be filled without also causing the BTRFS quota to be reached.
@tomponline possibly has to do with the size of the writes. You could probably try
dd
with something likebs=4M conv=fdatasync
? I wonder if btrfs re-computes the quota on sync and not on write or something.
I did start to think perhaps its a fragmentation issue, which could also be affected by block size. This would explain why the quota extents are larger than how BTRFS is tracking the actual used file sizes.
Same thing, but slightly different. It looks like using a larger block size still ends up exceeding the quota, but not by so much.
First try and fill a 12GiB disk:
root@v1:~# dd if=/dev/urandom of=/root/foo.img bs=4M conv=fdatasync
dd: error writing '/root/foo.img': Read-only file system
2591+0 records in
2590+0 records out
10866253824 bytes (11 GB, 10 GiB) copied, 131.96 s, 82.3 MB/s
Shows the BTRFS quota reached, and causes VM filesystem to be remounted readonly due to I/O errors.
sudo btrfs subvolume show --raw /var/lib/lxd/storage-pools/btrfs/virtual-machines/v1
virtual-machines/v1
Name: v1
UUID: 5415bc21-30ab-3e49-a13d-c187d2e4abcd
Parent UUID: 8b7b65af-066a-e441-83d0-100567859844
Received UUID: -
Creation time: 2021-12-01 17:57:06 +0000
Subvolume ID: 278
Generation: 3000
Gen at creation: 522
Parent ID: 5
Top level ID: 5
Flags: -
Snapshot(s):
Quota group: 0/278
Limit referenced: 12989759488
Limit exclusive: -
Usage referenced: 12989743104
Usage exclusive: 10804129792
Stop the VM and check the BTRFS quota still exceeded by manually setting the size.state.
lxc stop v1
lxc config device set v1 root size.state=100MiB
Error: Failed to write backup file: Failed to create file "/var/lib/lxd/virtual-machines/v1/backup.yaml": open /var/lib/lxd/virtual-machines/v1/backup.yaml: disk quota exceeded
Now, set a very large size.state, and repeated the previous experiment, which now succeeds in that we still fill up the root image file, but get "No space left on device" rather than "Read-only file system" because the BTRFS quota hasn't been reached.
lxc config device set v1 root size.state=10GiB
lxc start v1
lxc shell v1
root@v1:~# dd if=/dev/urandom of=/root/foo.img bs=4M conv=fdatasync
dd: error writing '/root/foo.img': No space left on device
2720+0 records in
2719+0 records out
11407343616 bytes (11 GB, 11 GiB) copied, 137.557 s, 82.9 MB/s
Now try to find how much it would have gone over quota by reducing the size.state incrementally.
lxc stop v1
lxc config device set v1 root size.state=100MiB
Error: Failed to write backup file: Failed to create file "/var/lib/lxd/virtual-machines/v1/backup.yaml": open /var/lib/lxd/virtual-machines/v1/backup.yaml: disk quota exceeded
...
lxc config device set v1 root size.state=969MiB
So about 969MiB over compared to 1098MiB. Not much in it though.
@stgraber so its not the async nature of the quota that matters in this case, because the disk image size itself should (and does) prevent the filesystem from growing beyond the root image disk size.
The issue is that BTRFS is seeing that usage as exceeding its quota by a large amount, even though its own filesystem du tool shows the disk image at the size expected. Starting to feel like a BTRFS bug.
What kernel are you testing on?
Focal hwe (5.11.0-41-generic)
Ok, I guess you could run a quick test on linux-generic-hwe-20.04-edge (5.13) but that's not sounding too promising.
Tried it with 5.13.0-22-generic
and same result I'm afraid.
@tomponline and just to be sure, we're feeding those btrfs quotas in bytes not in MB/MiB? Otherwise it could be a unit issue if say we create the img file using the exact size in bytes from a value in MB/GB but then set the quota in MiB/GiB instead.
If that's all good, then it'd be good to send a minimal reproducer (ideally without lxd/qemu involved) to the btrfs mailing-list, reference it here and close this issue.
I've not been able to reproduce it without QEMU involved sadly, using a loopdev on the host doesn't exhibit the issue.
See https://github.com/lxc/lxd/issues/9124#issuecomment-983825398
I'll triple check the setup of the btrfs quotas and disk image size.
Running lxc config device set v1 root size=11GiB
on a fresh VM on a BTRFS storage pool results in this debug log output:
DBUG[12-02|17:34:59] SetInstanceQuota started driver=btrfs pool=btrfs project=default instance=v1 size=11GiB vm_state_size=
DBUG[12-02|17:35:00] Moved GPT alternative header to end of disk driver=btrfs pool=btrfs dev=/var/lib/lxd/storage-pools/btrfs/virtual-machines/v1/root.img
DBUG[12-02|17:35:00] Accounting for VM image file size driver=btrfs pool=btrfs sizeBytes=11916017664
DBUG[12-02|17:35:00] SetInstanceQuota finished driver=btrfs pool=btrfs project=default instance=v1 size=11GiB vm_state_size=
So we can see the actual size for the BTRFS quota is being calculated at 11916017664 bytes
which is 11GiB (11811160064 bytes)
for the root disk file + 100MiB (104857600 bytes)
for the state files = 11916017664 bytes
.
And we can see from sudo btrfs subvolume show --raw /var/lib/lxd/storage-pools/btrfs/virtual-machines/v1
that the "Limit Referenced" value is 11916017664
, which matches what LXD was requesting.
The /var/lib/lxd/storage-pools/btrfs/virtual-machines/v1/root.img
is correctly sized to 11GiB (11811160064 bytes)
:
sudo du -b /var/lib/lxd/storage-pools/btrfs/virtual-machines/v1/root.img
11811160064 /var/lib/lxd/storage-pools/btrfs/virtual-machines/v1/root.img
So the VM's filesystems should not be able to exceed 11811160064 bytes
, and when setting size.state=10GiB
(to allow the root disk filesystem to fill up without reaching BTRFS' quota), and then running inside the VM:
dd if=/dev/urandom of=/root/foo.img bs=4M conv=fdatasync
dd: error writing '/root/foo.img': No space left on device
2474+0 records in
2473+0 records out
10374795264 bytes (10 GB, 9.7 GiB) copied, 126.42 s, 82.1 MB/s
Then according to sudo btrfs fi du --raw /var/lib/lxd/storage-pools/btrfs/virtual-machines/v1/root.img
it does not exceed the BTRFS quota, nor the size of the root disk file 11811160064 bytes
(as expected):
sudo btrfs fi du --raw /var/lib/lxd/storage-pools/btrfs/virtual-machines/v1/root.img
Total Exclusive Set shared Filename
11806830592 11806830592 0 /var/lib/lxd/storage-pools/btrfs/virtual-machines/v1/root.img
But for some reason when using QEMU, BTRFS seems to think that the quota has been exceeded, so when trying to reduce the size.state
back down to the default 100MiB
it won't allow it even though we know the disk image size hasn't been exceeded.
lxc config device set v1 root size.state=100MiB
Error: Failed to write backup file: Failed to create file "/var/lib/lxd/virtual-machines/v1/backup.yaml": open /var/lib/lxd/virtual-machines/v1/backup.yaml: disk quota exceeded
lxc config device set v1 root size.state=970MiB # Smallest size allowed by BTRFS
So now we see the "Limit Referenced" bytes that BTRFS thinks has been used:
sudo btrfs subvolume show --raw /var/lib/lxd/storage-pools/btrfs/virtual-machines/v1
virtual-machines/v1
Name: v1
UUID: c7408307-e584-9e45-b9eb-98d740999f89
Parent UUID: 8b7b65af-066a-e441-83d0-100567859844
Received UUID: -
Creation time: 2021-12-02 17:33:18 +0000
Subvolume ID: 266
Generation: 7350
Gen at creation: 5842
Parent ID: 5
Top level ID: 5
Flags: -
Snapshot(s):
Quota group: 0/266
Limit referenced: 12828278784
Limit exclusive: -
Usage referenced: 12828004352
Usage exclusive: 12828004352
Limit referenced: 12828278784
@stgraber the thing I cant figure out is why this is only apparently happening when accessing the disk file QEMU, I've even tried getting QEMU to access the disk file via a manually setup loopdev (so passing /dev/loop21
to QEMU rather than /var/lib/lxd/storage-pools/btrfs/virtual-machines/v1/root.img
) and it had the same behaviour. So it seems to be an interplay between BTRFS quotas and something that QEMU's I/O is doing that is causing problems.
@tomponline likely has to do with the API used by QEMU, may be related to asyncio with multiple I/O threads on the same block or something and btrfs incorrectly accounting for those writes.
I'm closing the LXD issue as if there's one thing that's clear right now is that our quota and file size calculation is all correct, it's the enforcement which is problematic.
@tomponline can you send what you have to the btrfs mailing-list (or bug tracker if they have one) and we'll see if they come up with anything useful.
I've asked on #btrfs IRC and if get no reply will post to linux-btrfs@vger.kernel.org
Chatting on #btrfs IRC, forza
(@Forza-tng ?) says (paraphrased):
Extents are immutable so when blocks are written to they end up in new extents and the old remains until all of its data is derefernced or rewritten. You'd need up to double quota to be safe. You have to allow for 200% space usage. And try compress-force and autodefrag. How well Autodefrag works depends on the workload. Extents in btrfs are immutable worst case is when only 4k of 128MiB (max extent size) is refereced and 128MiB-4k is wasted.
mutlicore
says:
I'd probably use compress or compress-force with datacow vm images to limit the extent sizes compression limits max compressed extent size to 128KiB, uncompressed ones can still be 128MiB. Compress-force is the same, but it limits uncompressed extent sizes to 512KiB (currently as a side-effect of sorts)
I was asked to run the compsize
tool which also accounted for the extra usage:
sudo compsize --bytes /var/lib/lxd/storage-pools/btrfs/virtual-machines/v1/root.img
Type Perc Disk Usage Uncompressed Referenced
TOTAL 100% 12967923712 12967923712 11807326208
none 100% 12967923712 12967923712 11807326208
@stgraber so shall I look into the compress-force
mount argument for BTRFS VM volumes and/or go for the belt-and-braces approach of using a BTRFS quota of <size.state size>+(2*<disk image size>)
?
@stgraber there's various info about compression on https://btrfs.wiki.kernel.org/index.php/Compression, but the problem I see with compress-force
is that its a mount option and so would affect all files in the storage pool. You can enable compression on a per-file basis (https://btrfs.wiki.kernel.org/index.php/Compression#Can_I_force_compression_on_a_file_without_using_the_compress_mount_option.3F) but confusingly this doesn't enable compress-force
for a file, but forcefully enables compress
(which uses heuristics to decide whether or not to compress).
It doesn't look like you can enable compress-force
for a single file.
Hmm, so doubling the quota feels wrong as users will (rightfully) expect the volume to not exceed the quota they set...
The compress stuff has been absolutely horrible at causing filesystem corruption in the past and as it's indeed a global option for the entire filesystem, I'd be quite worried to turn it on.
One thing that comes to mind though is that I thought the recommendation was for VM images to be marked as nocow through a filesystem attribute. I wonder if that would improve this behaviour and what the downside would be.
The compress stuff has been absolutely horrible at causing filesystem corruption in the past and as it's indeed a global option for the entire filesystem, I'd be quite worried to turn it on.
Isn't BTRFS wonderful :(
I thought the recommendation was for VM images to be marked as nocow
Interesting, I had no idea the nocow
option existed, nor that it was the recommended option for VM images.
I found this https://btrfs.wiki.kernel.org/index.php/FAQ#Can_copy-on-write_be_turned_off_for_data_blocks.3F and will see if that helps. It might well do given what we know now.
@stgraber I'm afraid the nowdatacow
option didn't work, I initially was encouraged, but after using snapshots of VMs from an image it didn't work. I've also checked that I am applying the +C
attribute correctly, because it can only be applied to empty files.
I tested this by running chattr +C
on an existing file and it didn't show as applied with lsattr
(even though the chattr command didn't fail with an error - another bug then), whereas it did apply and show with lsattr
when it was added to the empty root file before the image unpacker was run, and it was still showing as applied when on VM snapshot of the image volume.
Sadly it still managed to reached the quota, and compsize showed the same issue as before.
I wonder if we shouldn't use the 2x capacity approach, but rather than silently add it, check when applying the quota that it allows for 2x the disk image size?
@stgraber Good morning. I've found out why I initially thought that the nodatacow
option was working and then abruptly changed my mind. The reason is that initially I was testing on a VM I had imported from a backup (so it wasn't a snapshot of an image) whereas later I was testing on a VM that was created as a snapshot from an image volume.
Further reading on the subject of nodatacow
revealed this post:
https://www.spinics.net/lists/linux-btrfs/msg35491.html
Second, there's the snapshotting exception. Because a btrfs snapshot locks the existing file data in place with the snapshot, the first modification to a fileblock after a snapshot will force a COW for that block, even on an otherwise nocow file. The nocow attribute remains in effect, however, and further writes to the same block will modify it in- place... until the next snapshot of course.
So the issue is that for VMs created as a snapshot from the VM image volume, the first write to a block will necessarily cause a CoW operation, and thus the VM volume's quota usage will be increased because it references both the old and new extents of that block (this is why compression helps because it reduces the maximum extent size and so the issue is not as exacerbated).
It gets worse though.
I've noticed that the backup import system currently only restores the primary volume's quota, not the state volumes' quota. This means that for BTRFS backup imports the subvolume is restored with no quota.
Fixing this issue then causes another serious problem.
Whilst, in the previous example the image volume size is set to the default 10GiB, the actual data usage size is whatever the image size is (for Ubuntu Focal its 4GiB approx). So there is some leeway before the quota is totally filled up.
However when exporting a VM to a non-optimized backup and then reimporting it, the full raw image file is written back to the subvolume. Combined with the fix size.state quota restoration, means that the disk file is effectively considered full from a BTRFS quota perspective. This means that if a snapshot is taken of that restored VM, any write to that VM will cause a CoW event, and almost immediately reach the subvolume quota (less than 100MiB of writes need to occur before it is reached).
So we are in a tricky position:
If we fix the backup restoration issue so that size.state quota is set correctly, then this will mean any subsequent snapshot of the restored VM will very quickly cause the source VM's disk to fail with I/O errors as it will hit the underlying BTRFS quota.
@stgraber this is effectively the same issue as LVM has for non-thin volume snapshots (where snapshots have to be created with size that limits the total number of CoWs that can occur). For the LXD LVM driver, this has been addressed by creating the snapshot at the same size as the volume (effectively doubling the quota):
https://github.com/lxc/lxd/blob/master/lxd/storage/drivers/driver_lvm_utils.go#L397-L408
We effectively need to do the same and account for the BTRFS snapshot CoW, but using BTRFS semantics of doubling the quota (which is not as nice as the LVM approach, as we cannot assign that additional quota just for CoW usage).
I realise you said above:
Hmm, so doubling the quota feels wrong as users will (rightfully) expect the volume to not exceed the quota they set...
But as we already have a precedent for this in the LVM driver (i.e if I set an LVM volume to10GiB size and then take a snapshot then writes to the original LVM volume can now take up to 20GiB of space due to accounting for CoW of the snapshot), does this change your position?
Not really as in the LVM case it still won't allow me to dump 10GiB of state data on top of the quota.
In the btrfs case, if we use the approach of silently doubling the quotas and we have users who happen to start using stateful stop/snapshot, they will be allowed to exceed their quotas, potentially by tens of GiB, completely messing up any chance of doing proper resource control on the system (thinking of shared environments with restricted projects combined with projects limits).
Not really as in the LVM case it still won't allow me to dump 10GiB of state data on top of the quota.
This is because we have conflated the state and disk file quotas by not using a separate subvolume (without quota) for the disk image file, whereas with LVM there are separate LVs for state and root disk data.
In theory using a single subvolume made sense, but given how BTRFS does CoW accounting for quotas, using a separate subvolume, although a lot larger change, seems like the best approach to address this cleanly.
It would still be papering over an upstream issue. Yes, doing two volumes would help a bit for the block case, but we'd still get that failure on the fs volume as that one would still need a limit and so would hit the bug if ever snapshotted.
Similarly, we could absolutely reproduce this issue with a container filesystem.
I'm usually not very keen on papering over other people's bugs especially if we can't take care of the entire issue in a consistent way.
I still think the best we can do here is document the btrfs issue and let people decide what they want to do. For most I'm hoping it will be staying away from btrfs while those who really want btrfs should probably consider compression.
OK I will update the docs absolutely.
Even though LXD sets the BTRFS quota 100MiB (by default) larger than the VM disk file max size, if the VM uses all of its disk space the underlying BTRFS filesystem sees that the referenced disk quota has been reached and prevents LXD from starting the VM because it cannot write the backup file. Even though there should be some space free.
It seems like BTRFS quota isn't working the way we think it does.
See https://discuss.linuxcontainers.org/t/btrfs-issues-storage-pools-btrfs-empty-and-btrfs-quota-100-while-inside-the-vm-only-48-utilized/11897
Steps to reproduce:
Check BTRFS quota set (expect it to be blockSize 11000004608 + 100MiB (104857600 bytes) = 11104862208 bytes):
Check size of root disk file (expect it to be 11000004608 bytes):
Start the VM:
Now inside v1 run until the disk fills up (should fill up the disk image but not reach BTRFS quota as it has another 100MiB allowed):
You can now see that the BTRFS referenced quota has been reached, which it shouldn't have been.
Disk image is still at set size of 11000004608.
And the actual used blocks of the image are:
Indeed the total size of the volume is less than the quota: