btrfs / btrfs-todo

An issues only repo to organize our TODO items
21 stars 2 forks source link

make the last 1% of unallocated space, mixed-bg type #10

Open cmurf opened 3 years ago

cmurf commented 3 years ago

Instead of btrfs-maintenance scripts, which aren't included in upstream btrfs-progs anyway, and various 'user as baby sitter' methods of manual balancing acts - a large enough mixed-bg block group would be able to dynamically deal with a mismatch between metadata/data block group on-disk allocation ratio and a workload that suddenly has a different ration (thereby resulting in premature enospc of one type of bg or the other).

I don't know if it should be a mkfs time allocation and then pinned (unused) until needed? Or if the kernel should just start creating mixed-bg's for the last 1% of unallocated space? Or if that percentage is its own bad joke? Or if this idea means I'm on extra special crazy pills and this should just be closed without comment? But at least I get an April 1 prize.

jeffmahoney commented 3 years ago

Using mixed groups for emergency allocations is something I’ve kicked around for a while but I had in mind dynamically converting an existing block group to mixed. One hiccup for both your idea and mine is that mixed groups and segregated groups aren’t allowed on the same file system on any release.

cmurf commented 3 years ago

Maybe "convert" only the unused portion of a block group? i.e. split some block groups, favoring the least full ones. Don't move any extents, just change the unused portion of a block group, momentarily into unallocated space. Then the normal allocation can create a block group of the needed type in that unallocated space, and avoids mixed groups altogether.

Aging this over a long time, it may end up looking something like mixed-bg, although with possibly 1000's of small chunks in the chunk tree. I'm not sure if that's a bad thing or not.

jeffmahoney commented 3 years ago

Having a ton of tiny segregated chunks isn’t a recipe for happiness. I think being able to convert entire chunks to (and from) mixed covers most of the use case. It gets tricky when it comes to management. We’d do this conversion to mixed on the fly automatically but it would be a pretty poor UX to then expect the user to clean up the mess afterward.

kdave commented 3 years ago

I"m opposed to introducing mixed bgs to a filesystem that originally was not created as such. This feels like trading one set of corner case of problems with another yet unexplored, which would need special casing in the enospc behaviour and allocator. That the data:metadata usage ratio can change is and could be prevented by better preallocation of each bg type. Another way is to make the blockgroups more compact in way that does not require the unused workspace like current relocation.

kdave commented 3 years ago

Instead of btrfs-maintenance scripts, which aren't included in upstream btrfs-progs anyway, and various 'user as baby sitter' methods of manual balancing acts

The maintenance scripts are intentionally a separate project as it solves a different task than the progs: system services vs basic tool support. There's a different release schedule and the implied need to restart all the services when progs are updated. If you have concerns or questions regarding the projects, please open an issue in one of them.

cmurf commented 2 years ago

I've encountered downstream bug 74% used, dnf reports no space left on device with a 5.14 series kernel, single device, with allocation that looks like this:

# sudo btrfs filesystem usage /
Overall:
    Device size:         215.27GiB
    Device allocated:        215.27GiB
    Device unallocated:        1.00MiB
    Device missing:          0.00B
    Used:            158.25GiB
    Free (estimated):         56.26GiB  (min: 56.26GiB)
    Free (statfs, df):        56.26GiB
    Data ratio:               1.00
    Metadata ratio:           2.00
    Global reserve:      374.73MiB  (used: 752.00KiB)
    Multiple profiles:              no

Data,single: Size:209.24GiB, Used:152.98GiB (73.11%)
   /dev/mapper/luks-e5fbe4ab-0ae9-4428-87c0-5c98b5acadd1     209.24GiB

Metadata,DUP: Size:3.00GiB, Used:2.64GiB (87.75%)
   /dev/mapper/luks-e5fbe4ab-0ae9-4428-87c0-5c98b5acadd1       6.01GiB

System,DUP: Size:8.00MiB, Used:48.00KiB (0.59%)
   /dev/mapper/luks-e5fbe4ab-0ae9-4428-87c0-5c98b5acadd1      16.00MiB

Unallocated:
   /dev/mapper/luks-e5fbe4ab-0ae9-4428-87c0-5c98b5acadd1       1.00MiB

df reports it 74% used. And yet we get out of space, even when doing a filtered balance e.g. -dlimit=5

I think it's pretty clear the file system was used with a fairly low metadata workload until all space was used; much of that data was then deleted but not in a way that freed any data bg's, then filled again with a much more metadata heavy workload. I guess I wonder if it'd make any sense to have a simple scaling function that makes metadata bg allocation more aggressive with young file systems (unallocated/allocated), e.g. try to keep the unused space in metadata bg's at 50%. Later on this case taper off to avoid over allocated metadata bg's and running into premature data enospc. Still, premature data enospc is probably less bad, in that it's less wasteful, than the premature metadata enospc case which as in this example leaves the file system unusable at merely 74% used. In this particular case, even an automatic attempt at converting a data bg to a metadata bg may have failed for the same reason the balance with -dlimit=1 fails.

josefbacik commented 2 years ago

This is going to be addressed by @boryas soon, this style of failure is becoming a problem internally as well. We've done too good of a job of not allocating metadata chunks, it's time to change how much free metadata we keep around to stop this style of problem from happening.

Zygo commented 2 years ago

it's pretty clear the file system was used with a fairly low metadata workload until all space was used

This can also happen if the user runs a "maintenance" balance on metadata block groups. Metadata balance on a non-full filesystem pretty much guarantees ENOSPC at some point when the filesystem fills up, possibly months in the future. "Never balance metadata" is the one-sentence oversimplified workaround--eventually enough metatdata block groups accumulate to reserve enough space that we stop running out, and then we only need to make sure we never delete them (as a metadata balance does).

In practice there needs to be something like 2 + number_of_devices block groups' worth of free metadata space all the time to handle all the special cases (e.g. when a block group becomes read-only during a scrub or balance, you suddenly have less usable metadata space, and hit ENOSPC when trying to get more). It leads to fun corner cases like "can't replace a small failed disk in raid1 because only one metadata BG has free space, and replace locked it."

kakra commented 2 years ago

Maybe meta balance should allow for a target "balance all metadata chunks to 80% usage", this could reverse bad effects of previous balances... Of course, that doesn't help, if it's already too late.

Zygo commented 2 years ago

The percentage would depend on block group count and filesystem size. For most filesystem sizes 99% is fine, but on filesystems below 1TB or so, you have to do math to figure out what the percentage should be. On smaller filesystems it ends up being under 0%, i.e. you have to have empty block groups lying around.

kakra commented 2 years ago

The percentage could be made a parameter for balance until someone figures out a nice formula for this. I didn't mean to suggest 80% as a static value, I rather wanted to suggest a different balance mode that would target some percentage instead of using it as a threshold and filling everything below that to 100%.

Zygo commented 2 years ago

you have to have empty block groups lying around.

OK that's a confusing way to say it. There's a minimum number of block groups, regardless of size. Something like btrfs-balance-least-used could walk the metadata block group items and figure out if there's enough space in the right number of block groups, then do the right amount of balancing, then stop.