Extent tree v2 - Githubissues

josefbacik commented 3 years ago

Existing extent tree

The existing extent tree has a lot of big advantages, but some other disadvantages.

Advantages

Every block is tracked. We can easily (relatively speaking) find out who points at a particular data extent or metadata block.
Reference counting is straightforward. We insert a reference for every allocation, we delete a reference for every removal, when the reference counter hits 0 we free the extent to be re-used for allocation.
Snapshot reference counting is relatively low impact. We only modify references on COW, and thus creating snapshots is a O(1) operation.

However there are some disadvantages that are starting to become more and more of a problem. We also have some other things that are made much worse by the fact that we didn't design a system with the set of features we have today, so I would like to take this opportunity to rethink a few things related to the extent tree in order to set ourselves up better for the future.

Problems that I want to solve

The extent tree is recursive. Because we track every block, modifying the extent tree creates new references which could potentially create new blocks. In practice there is a steady state, however this makes things like ENOSPC handling very complicated, as we have to reserve a lot of space up front in order to make sure we'll be able to make our reservations.
The extent tree doesn't scale well to multiple modifiers. I wrote patches a year ago to cut down on the amount of places we run delayed refs, because we could get a bunch of threads trying to run delayed refs and they'd all hit lock contention on the root node. We could be better about our locking here, but in the end this is a single tree for the entire file system. Breaking this out into per-block group would make a huge impact in our parallelism.
The csum tree is a huge source of lock contention. If you spin up a workload that bombards the system with writing to many files, you'll see latencies for finish_ordered_extent go up because of the threads getting held up in lock contention on the csum tree. Latencies in finish_ordered_extent can cause latencies on anything that needs to wait for ordered extents, like the ENOSPC flushing code or fsync.
Relocation. Right now relocation is a huge unwieldily beast because we have to create snapshots for everything, and then modify all paths to relocated blocks at once. This generates a lot of delayed refs which can easily push transaction commits into minutes long stalls. This particular thing could be fixed without changing the extent tree, but it's a fragile scary piece of code that's prone to errors.
Bookend extents. This is explained here in detail, but basically because we track references to full extents, we're unable to reclaim space that we're no longer referring to because we only reference the entire extent, and have no way to split the extents and free unused space.
Qgroups. Right now the only way to track space is to do a backreference lookup for every single block that we modify. This is hugely expensive, and is a problem mostly because of how we do our reference counting. If a data block only has the original root id reference we have to do a huge lookup to see if there are any roots that point to it from higher up the backreference chain. This could be avoided if we kept track of the amount of space referenced in roots by default, eliminating the need for any backreference lookups.
RAID5/6. @morbidrsa has a design doc for a stripe-tree to remove the on disk allocation out of the extent tree. As it stands now we use the extent tree to allocate the actual on disk space, and that is just an offset into a block group, and this maps directly to the physical space. Instead if we push the on disk translation down into a lower layer we can translate the logical address from the extent tree into any physical location on disk, and this allows us to be a lot more flexible with our on disk allocation strategies.

Uh, so how does changing the extent tree fix all these problems?

Firstly, it doesn't, at least not all of the problems. I'm wrapping all of these changes under the label "extent tree v2" because it's the biggest part, but it's actually going to be a series of on disk changes. However I do not want to do these things piecemeal because it'll hamper adoption. There needs to be a clear line in the sand where you choose "extent tree v2" and you get all these disk format changes moving forward. This will drastically simplify our testing matrix, as right now we have a lot of "features" that require toggling on and thus we have a lot of different features that can be mixed and matched. This will be a lot of changes, but it will make our lives easier going forward. My plan is to make the following specific changes

Make the extent tree per-block group. This addresses the parallelism problem.
Only track data extents and shared metadata blocks in the extent tree. We no longer track cow-only trees in the extent tree, we no longer track non-shared FS tree blocks in the extent tree.
Introduce a stripe tree to translate between logical addresses and physical addresses. This is me blatantly stealing @morbidrsa's idea for my own reasons, but will allow him to do his stuff as well.
Track direct and shared bytes per root. This will allow us to get qgroup numbers for free without backreference lookups.
Change the reference counting rules for data extents such that they only point at their referenced extents. This does away with the bookend extent behavior and allows us to reclaim space on fragmented files.
Make the free space tree and the csum tree per-block group. This is for the same reason as making the extent tree per-block group, it will increase our parallelism.
Change relocation to operate on a stripe level instead of logical level. Relocation has to be super careful with it's reference counting, and can actually explode metadata usage with lots of snapshots. In addition the more snapshots you have the longer it takes. What we really care about is being able to move extents out of a block group. We can do this with the stripe tree, we simply remap the physical extents to a new location and copy the extent to the new location. Relocation becomes a matter of just copying the extents and remapping their physical location. We go at the same speed no matter how many snapshots there are.

This all sounds great, are there any downsides?

Scrub

The biggest downside that I can think of right now is scrub. Right now scrub is relatively straightforward, it just marks a block group as read only, finds all the extents in that block group, reads them because you can find the owner easily, and bam you're done. With the extent tree reference counting no longer including information about cowonly trees and non-shared fs blocks we'd have to read all of the trees by searching down them. This means that we would have to keep a cache of fs tree's blocks that we've read in order to make sure we don't scrub the same tree's over and over again. This also makes it tricky because we can no longer just mark a block group read only while we do our work.

This isn't a huge drawback, data would still work the same, and we can simply search the commit roots and skip any blocks that are younger than the generation that we started the scrub with. It will be more memory to track where we've been, but the overall complexity should be relatively minimal.

Backreferences

We would no longer be able to lookup who owns arbitrary metadata blocks. I don't believe this to be a big issue because

We will no longer require backreference walking for metadata blocks for qgroups.
We will still have the ability to find all roots pointing to a specific data extent. I think the only thing this affects is something like scrub, but that's because before we'd read arbitrary locations and use the reference to know which tree it belonged to. With the new scheme we'd be searching down from the tree root, so we'd know which tree we were in, and thus could print the correct information.

The specifics

A block group tree

As indicated in a comment below, our block group items are spread all over the disk and makes mount times a big problem with really large drives. Fix this by making a tree to hold all the block group items. That's it, its pretty straightforward, no other real changes here.

Per-block group trees

This will be easy enough, we know the logical offset of the bytenr we care about, we can find the block group, and thus will be able to lookup the corresponding per-bg root for whatever operation we're doing.

Track only shared blocks and data in the extent tree - CHANGING DRASTICALLY, SEE COMMENT ABOUT DROP TREES

This is the big scary thing. You can find a description of how references currently work here and here. I'm not proposing changing the items as they stand right now, simply the rules around how we change them.

We no longer add extent references for cowonly trees. This is simple and straightforward, we just don't update the extent tree with those references. These blocks only get a ref added on allocation and it removed on deletion, there's no special rules for them.

We no longer add extent references for non-shared fs tree blocks. This is where things get a little tricky. A non-snapshotted fs tree will have 0 entries for its metadata in the extent tree. In practice a non-snapshotted fs tree acts the same as a cowonly tree, we add a ref on allocation, delete it on deletion. The trick comes into play when we snapshot. The same rules will apply as they always have, with a slight change if we are COW'ing down from the owner root.

                  +-------+         +-------+
                  |       |         |       |
                  |   A   |         |   A'  |
           +------+------++---------+-+-----+--------+
           |             |            |              |
           |             |            |              |
           |             |            |              |
           |             |            |              |
           |             |            |              |
           |             |            |              |
         +-v-----+    +--v----+     +-v-----+     +--v----+
         |       |    |       |     |       |     |       |
         |       |    |       |     |       |     |       |
     +---+-------+ +--+-------+ +---+-------++    +-----+-+------------+
     |             |            |            |          |              |
     |             |            |            |          |              |
     |             |            |            |          |              |
     |             |            |            |          |              |
+----v--+     +----v--+    +----v--+     +---v---+   +--v----+     +---v---+
|       |     |       |    |       |     |       |   |       |     |       |
|       |     |       |    |       |     |       |   |       |     |       |
+-------+     +-------+    +-------+     +-------+   +-------+     +-------+

We will have a normal reference at everything at level 1 for A' but not for A. We'll cover two full examples, first cow'ing from A and then A', then the reverse.

COW from A

Cow the root, this works normally, no extent references modified.
Cow block at level 1.
1. generation < snapshot generation, look up extent reference count.
2. The extent reference count is 1 because of A', we add 1 to this because btrfs_header_owner() == root.
3. btrfs_header_owner() == root and refcount > 1, btrfs_inc_ref(full_backref == 1), set extent flags for this block to FULL_BACKREF.
4. We are at 1 level below root, free our reference to this block for A.
Cow block at level 0.
1. generation < snapshot generation, look up extent reference count.
2. The extent reference count is 1 because of the FULL_BACKREF from level 1. Add 1 to this because btrfs_header_owner() == root as this is implied.
3. btrfs_header_owner() == root and refcount > 1, btrfs_inc_ref(full_backref == 1), set extent flags for this block to FULL_BACKREF.

COW from A'

Cow the root, this works normally, no extent references modified.
Cow block at level 1.
1. generation < snapshot generation, look up extent reference count.
2. Reference count is 1 and FULL_BACKREF is set, btrfs_inc_ref(full_backref == 0) so we push down our normal ref to the children, btrfs_dec_ref(full_backref == 1) so that the full backref is removed from the children.
3. Free our ref for this node, count is now 0, the extent is freed.
Cow block at level 0.
1. generation < snapshot generation, look up extent reference count.
2. The extent reference count is 1, FULL_BACKREF is set so we btrfs_inc_ref(full_backref == 0) to push our normal reference to the children, btrfs_dec_ref(full_backref == 1) to remove the full backref reference for the children.
3. We remove our normal reference for this block the block is freed.

And now for the other way, starting with a cow down from A'

COW from A'

Cow the root, this works normally, no extent references modified.
Cow block at level 1.
1. generation < snapshot generation, lookup extent reference count.
2. We have our reference which is 1, root_objectid != btrfs_header_owner() and FULL_BACKREF isn't set, there's an implied reference for the original owner, which makes our reference count 2.
3. We are not the owner so we btrfs_inc_ref(full_backref == 0) in order to push our reference down to all the children.
4. We free our reference to this block.
Cow block at level 0.
1. generation < snapshot generation, lookup extent reference count.
2. We have our reference from the inc_ref before which is 1, root_objectid != btrfs_header_owner() and FULL_BACKREF isn't set, there's an implied reference for the original owner, which makes our reference count 2.
3. We are not the owner so we btrfs_inc_ref(full_backref == 0) in order to push our reference down to all the file extents.
4. We free our reference to this block.

COW from A

Cow the root, this works normally, no extent references modified.
Cow block at level 1.
1. generation < snapshot generation, lookup extent reference count.
2. There is no extent reference, which means this block isn't shared anymore, don't do anything with the extent references.
3. Free the extent to be re-used.
Cow block at level 0.
1. generation < snapshot generation, lookup extent reference count.
2. There is no extent reference, which means this block isn't shared anymore, don't do anything with the extent references.

The implied reference from the original owner is somewhat tricky, so the logic in update_for_cow() would need to be updated to account for these rules, which are simply

If root_objectid != btrfs_header_owner() && extent flags != FULL_BACKREF, implied reference.
If root_objectid == btrfs_header_owner() && reference exists in the extent tree, we need to convert to FULL_BACKREF and push down as we COW.

Data extent references

Data extent references need to continue to work as before, as we have more complicated operations we can do with data, such as clone. The only change here is we no longer do bookend extents. Currently the way we handle writing to the middle of an existing file extent is this (this is the normal non-compressed/encrypted case)

Split the file extent item.
The right side has offset == 0, and has it's num_bytes adjusted to reflect the new size we reference.
The left side has offset == the end of the split, and we inc a ref to our disk_bytenr because of the new item, using original offset (which is key.offset - file_item->offset) as part of the key.
Insert the new file extent that points to the new extent.

The space that is no longer referenced that exists in the slot where the new file extent was placed is now wasted, as it will not be free'd until the left and right side of the extents are eventually freed.

The new scheme will be the following.

Free the reference for the original file extent item.
Split the file extent item.
Add a reference to the left section of the split file extent, the file extent disk_bytenr and disk_num_bytes will be modified to reflect the actual new size of the file extent, offset remains 0.
Add a reference to the right section of the split file extent, the file extent disk_bytenr and disk_num_bytes will be modified to reflect the actual new size of the file extent, offset remains 0.
Add the new file extent.

The space in the area that the new file extent replaces will be freed to be re-allocated again.

The compression case will remain the same, as we have to have the entire compressed extent to extract the area the file extent points to, but this isn't as bad because we limit our compressed extent sizes to 128k, thus we can only waste 128k-4k worth of space per extent at any given time because of bookend extents, compared to 128m-4k worth of wasted space per extent with the normal case.

Stripe tree - PROBABLY NOT GOING TO DO THIS

EDIT: What I want to accomplish and what @morbidrsa want to accomplish are slightly different things, and trying to combine them will remove flexibility for both of us. I'm still going to tackle relocation, but this idea is being set aside.

This is fairly straightforward, it'll track the phyisical location of the logical address space. Traditionally this was just some math, we had a chunk that mapped the physical offset of a block group, and so we would just take the offset from logical address from the block group and use that offset off of the physical start of the block group.

We would replace this with a stripe tree that actually tracked physical locations for the logical offset, so it could be any arbitrary device and physical offset within that device. This would have the following items, again blatantly ripped off from @morbidrsa with some modifications for my uses as well

struct btrfs_stripe_extent {
        /* btrfs device-id this raid extent lives on */
        __le64 devid;
        /* offset from the devextent start */
        __le64 offset;
} __attribute__ ((__packed__));

key = {
        /* This is the logical address */
        .objectid = logical;
        .type = BTRFS_STRIPE_ITEM;
        .offset = size;
};

struct btrfs_dp_stripe {
        /* Original logical offset, if we were relocated. */
        __le64 original_logical_offset;
        /* How many extents are packed in here */
        __le32 nr_extents;
        /* Reseved for usage later/padding. */
        __le32 resv;
        /* array of RAID stripe extents this stripe is
         * comprised of
         */
        struct btrfs_stripe_extent extents[];
} __attribute__ ((__packed__));

key = {
        .objectid = logical;
        /* This would be less than BTRFS_STRIPE_ITEM so we could find it first. */
        .type = BTRFS_RELOCATION_ITEM;
        .offset = size;
};

/* This would be present if we had remapped a logical extent due to relocation. */
struct btrfs_relocation_extent {
        /* logical address of the new logical address */
       __le64 logical_offset;
};

Relocation - CHANGING FROM MY ORIGINAL IDEA

EDIT: Since I'm not doing the stripe tree I want to handle this in a different way. My plan is to do something sort of like what is described below, but instead make a new REMAP tree. If we relocate a block group we will set a REMAP flag on it's block group flags (maybe, I have to see if I can actually set the flags to something other than data type), and then populate the REMAP tree with where I've relocated the extents inside the block group. On mount this gets loaded up and we will translate any IO to the new logical offset where the extent resides. Once all of the extents have been freed from the block group the remap items will be deleted and the block group itself will be deleted. Of course the chunk will have been reclaimed by the time all of the blocks are remapped in the block group, so the space will be available, just the accounting will be removed later.

The new relocation behavior would be to migrate block groups as normal, but instead of walking the extent tree we would walk the stripe tree finding all stripes that exist in our logical space. We would then go find gaps in other logical areas that could host our stripe, read in the bytes from a btrfs_stripe_extent, and then write them to a new stripe and update the on disk stripe for the extent. We would keep track of the in memory mapping in an extent_map, so the operation would look something like this

Mark the block group read only.
Look up a btrfs_dp_stripe in that block group.
1. If we find a btrfs_dp_stripe then this is our first relocation.
2. If we find a btrfs_relocation_extent then this is our Nth relocation, pull the logical offset out and look up that dp_stripe.
Allocate a new extent in a different block group to give us a new logical offset.
Read in data, write data to the new logical offset.
Update the location of the new logical offset.
1. If the relocation_extent didn't already exist, insert a new one to point at the new logical extent.
2. If it did exist, update the logical extent.
Free the stripe.

Direct and shared bytes tracking per root - PROBABLY NOT GOING TO DO THIS

EDIT: We still have to do lookups after the fact to figure out if a shared extent went from N to 1 references. And we can likely accomplish this behavior with the current qgroup code, just would require some changes to runtime tracking and we don't need to do it on disk.

There is one tricky problem with qgroups, and that is the conversion of a data extent from shared to exclusive. This is tricky because it requires a full backref lookup everytime we modify an extent so that we can determine if we need to convert the file extent bytes from shared to exclusive. There is not much that can be done about this problem unfortunately, however we can make the intermediate tracking much simpler by storing the current shared and exclusive counts in the root items themselves. This would work like the following

1) Every metadata allocation is exclusive, automatically added to the root when it happens. 2) At snapshot time we do the following

orig_root->shared += orig_root->exclusive - nodesize; 
orig_root->exclusive = nodesize;
snap->shared = orig_root->shared;
snap->exlclusive = nodesize;

3) At COW time if we're shared we subtract our ->shared, the ->exclusive gets changed when we allocate a new block for the cow. 4) If the reference count went to 1 we know the root that points at us, or we have a fullbackref. If we have the root we go ahead and convert right there. If we have a fullbackref we mark the block to be looked up.

This will reduce the complexity of tracking these counters across the board, and reduce the amount of backref lookups we have to do.

josefbacik commented 2 years ago

I'd like to ask something that it's hopefully a bit more relevant: right now enabling quotas with lots of snapshots hangs the whole system for minutes, is it a problem extent tree v2 is going to alleviate?

Yes, there's a lot of reasons for this, and I'm addressing all of them (hopefully?)

qgroups relies heavily on counting the number of references to every individual extent that's changed in a transaction before and after that transaction. This involves doing a backref lookup for the current state of the fs, and again for the previous state. The backref lookup for us is not cheap at all. With the snapshot tree you can take your original subvol + snapshot id and O(1) find who could reference a bytenr. What we lose is the ability to easily tell who currently references us. Which brings us to...
All of this calculation has to be done in the transaction commit critical section, because we can't let anything change while we're looking all of this up. This delays transaction commits, which delays anything that may need to wait on a transaction commit. The garbage collection tree is the solution to this. We'll be async'ing the updates to the qgroup counters. Adding bytes to the counter isn't the hard part, it's going from N->1 and 1->0. N->1 is hard because that's where we have to add the counter to "exclusive" on the remaining reference. 1->0 is less tricky, we know who just free'd us so we can just drop the counter. With the garbage collection tree we'll be adding items to post-process to walk through these transitions and update the counters. This eliminates all of the latency inducing terribleness, because now it's 1 trans handle per update, and it's done outside of the critical stage. However this does induce some lag on the counters when removing things. The plan for that is if you are at your qgroup limit and you rm -rf things to make room, you'll get latency here for any operations on this qgroup because we'll be forced to wait for these async operations to complete. However this is purely latency on any operations inside that qgroup, it won't affect the rest of the file system.

Augusto7743 commented 2 years ago

Thanks very much for all improvements. Not demanding. What is date for release an working extent tree v2 in kernel ? Only to understand when reformat the disk and also if current code is stable I can use for testing in an disk only used for OS files ?

darkbasic commented 1 year ago

The squota is not suitable for the general use case but works well for containers where the original subvolume exists for the whole time.

It looks like squota won't be the magic bullet we where hoping for. I'm curious to know if extent tree v2 is still being worked on at this point.

josefbacik commented 1 year ago

The squota is not suitable for the general use case but works well for containers where the original subvolume exists for the whole time.

It looks like squota won't be the magic bullet we where hoping for. I'm curious to know if extent tree v2 is still being worked on at this point.

Squotas still works in this case, it just creates this awkward thing where the usage will be attributed to the deleted subvolume and you won't be able to see that. This is the example

/some/subvolume <- 100g of usage on disk, 100g of usage according to squotas
btrfs sub snap /some/subvolume /my/snap <- 100g of usage on disk, 0g of usage according to squotas
btrfs sub delete /some/subvolume
btrfs filesystem usage / <- shows 100g of usage on disk, squotas shows 0 because /some/subvolume was deleted

This is what we're referring to. Because we're not live tracking shared accounting, and we only attribute the usage to the original root, you can end up "losing track" of your actual usage. Now of course you can du -h /my/snap and see that's where the space is, we don't lie about actual usage. But if you list the usage from squotas it's only going to tell you what is actually accounted to the quota for that subvolume.

As for extent tree v2, yes I'm actively working on it. With the scope of the project the design has had to change while I was developing it and discovering flaws in certain areas. I hope to be code complete in the next couple of months.

btrfs / btrfs-todo

Extent tree v2 #25