btrfs / btrfs-todo

An issues only repo to organize our TODO items

21 stars 2 forks source link

Extent tree v2 #25

Open josefbacik opened 3 years ago

josefbacik commented 3 years ago

Existing extent tree

The existing extent tree has a lot of big advantages, but some other disadvantages.

Advantages

Every block is tracked. We can easily (relatively speaking) find out who points at a particular data extent or metadata block.
Reference counting is straightforward. We insert a reference for every allocation, we delete a reference for every removal, when the reference counter hits 0 we free the extent to be re-used for allocation.
Snapshot reference counting is relatively low impact. We only modify references on COW, and thus creating snapshots is a O(1) operation.

However there are some disadvantages that are starting to become more and more of a problem. We also have some other things that are made much worse by the fact that we didn't design a system with the set of features we have today, so I would like to take this opportunity to rethink a few things related to the extent tree in order to set ourselves up better for the future.

Problems that I want to solve

The extent tree is recursive. Because we track every block, modifying the extent tree creates new references which could potentially create new blocks. In practice there is a steady state, however this makes things like ENOSPC handling very complicated, as we have to reserve a lot of space up front in order to make sure we'll be able to make our reservations.
The extent tree doesn't scale well to multiple modifiers. I wrote patches a year ago to cut down on the amount of places we run delayed refs, because we could get a bunch of threads trying to run delayed refs and they'd all hit lock contention on the root node. We could be better about our locking here, but in the end this is a single tree for the entire file system. Breaking this out into per-block group would make a huge impact in our parallelism.
The csum tree is a huge source of lock contention. If you spin up a workload that bombards the system with writing to many files, you'll see latencies for finish_ordered_extent go up because of the threads getting held up in lock contention on the csum tree. Latencies in finish_ordered_extent can cause latencies on anything that needs to wait for ordered extents, like the ENOSPC flushing code or fsync.
Relocation. Right now relocation is a huge unwieldily beast because we have to create snapshots for everything, and then modify all paths to relocated blocks at once. This generates a lot of delayed refs which can easily push transaction commits into minutes long stalls. This particular thing could be fixed without changing the extent tree, but it's a fragile scary piece of code that's prone to errors.
Bookend extents. This is explained here in detail, but basically because we track references to full extents, we're unable to reclaim space that we're no longer referring to because we only reference the entire extent, and have no way to split the extents and free unused space.
Qgroups. Right now the only way to track space is to do a backreference lookup for every single block that we modify. This is hugely expensive, and is a problem mostly because of how we do our reference counting. If a data block only has the original root id reference we have to do a huge lookup to see if there are any roots that point to it from higher up the backreference chain. This could be avoided if we kept track of the amount of space referenced in roots by default, eliminating the need for any backreference lookups.
RAID5/6. @morbidrsa has a design doc for a stripe-tree to remove the on disk allocation out of the extent tree. As it stands now we use the extent tree to allocate the actual on disk space, and that is just an offset into a block group, and this maps directly to the physical space. Instead if we push the on disk translation down into a lower layer we can translate the logical address from the extent tree into any physical location on disk, and this allows us to be a lot more flexible with our on disk allocation strategies.

Uh, so how does changing the extent tree fix all these problems?

Firstly, it doesn't, at least not all of the problems. I'm wrapping all of these changes under the label "extent tree v2" because it's the biggest part, but it's actually going to be a series of on disk changes. However I do not want to do these things piecemeal because it'll hamper adoption. There needs to be a clear line in the sand where you choose "extent tree v2" and you get all these disk format changes moving forward. This will drastically simplify our testing matrix, as right now we have a lot of "features" that require toggling on and thus we have a lot of different features that can be mixed and matched. This will be a lot of changes, but it will make our lives easier going forward. My plan is to make the following specific changes

Make the extent tree per-block group. This addresses the parallelism problem.
Only track data extents and shared metadata blocks in the extent tree. We no longer track cow-only trees in the extent tree, we no longer track non-shared FS tree blocks in the extent tree.
Introduce a stripe tree to translate between logical addresses and physical addresses. This is me blatantly stealing @morbidrsa's idea for my own reasons, but will allow him to do his stuff as well.
Track direct and shared bytes per root. This will allow us to get qgroup numbers for free without backreference lookups.
Change the reference counting rules for data extents such that they only point at their referenced extents. This does away with the bookend extent behavior and allows us to reclaim space on fragmented files.
Make the free space tree and the csum tree per-block group. This is for the same reason as making the extent tree per-block group, it will increase our parallelism.
Change relocation to operate on a stripe level instead of logical level. Relocation has to be super careful with it's reference counting, and can actually explode metadata usage with lots of snapshots. In addition the more snapshots you have the longer it takes. What we really care about is being able to move extents out of a block group. We can do this with the stripe tree, we simply remap the physical extents to a new location and copy the extent to the new location. Relocation becomes a matter of just copying the extents and remapping their physical location. We go at the same speed no matter how many snapshots there are.

This all sounds great, are there any downsides?

Scrub

The biggest downside that I can think of right now is scrub. Right now scrub is relatively straightforward, it just marks a block group as read only, finds all the extents in that block group, reads them because you can find the owner easily, and bam you're done. With the extent tree reference counting no longer including information about cowonly trees and non-shared fs blocks we'd have to read all of the trees by searching down them. This means that we would have to keep a cache of fs tree's blocks that we've read in order to make sure we don't scrub the same tree's over and over again. This also makes it tricky because we can no longer just mark a block group read only while we do our work.

This isn't a huge drawback, data would still work the same, and we can simply search the commit roots and skip any blocks that are younger than the generation that we started the scrub with. It will be more memory to track where we've been, but the overall complexity should be relatively minimal.

Backreferences

We would no longer be able to lookup who owns arbitrary metadata blocks. I don't believe this to be a big issue because

We will no longer require backreference walking for metadata blocks for qgroups.
We will still have the ability to find all roots pointing to a specific data extent. I think the only thing this affects is something like scrub, but that's because before we'd read arbitrary locations and use the reference to know which tree it belonged to. With the new scheme we'd be searching down from the tree root, so we'd know which tree we were in, and thus could print the correct information.

The specifics

A block group tree

As indicated in a comment below, our block group items are spread all over the disk and makes mount times a big problem with really large drives. Fix this by making a tree to hold all the block group items. That's it, its pretty straightforward, no other real changes here.

Per-block group trees

This will be easy enough, we know the logical offset of the bytenr we care about, we can find the block group, and thus will be able to lookup the corresponding per-bg root for whatever operation we're doing.

Track only shared blocks and data in the extent tree - CHANGING DRASTICALLY, SEE COMMENT ABOUT DROP TREES

This is the big scary thing. You can find a description of how references currently work here and here. I'm not proposing changing the items as they stand right now, simply the rules around how we change them.

We no longer add extent references for cowonly trees. This is simple and straightforward, we just don't update the extent tree with those references. These blocks only get a ref added on allocation and it removed on deletion, there's no special rules for them.

We no longer add extent references for non-shared fs tree blocks. This is where things get a little tricky. A non-snapshotted fs tree will have 0 entries for its metadata in the extent tree. In practice a non-snapshotted fs tree acts the same as a cowonly tree, we add a ref on allocation, delete it on deletion. The trick comes into play when we snapshot. The same rules will apply as they always have, with a slight change if we are COW'ing down from the owner root.

                  +-------+         +-------+
                  |       |         |       |
                  |   A   |         |   A'  |
           +------+------++---------+-+-----+--------+
           |             |            |              |
           |             |            |              |
           |             |            |              |
           |             |            |              |
           |             |            |              |
           |             |            |              |
         +-v-----+    +--v----+     +-v-----+     +--v----+
         |       |    |       |     |       |     |       |
         |       |    |       |     |       |     |       |
     +---+-------+ +--+-------+ +---+-------++    +-----+-+------------+
     |             |            |            |          |              |
     |             |            |            |          |              |
     |             |            |            |          |              |
     |             |            |            |          |              |
+----v--+     +----v--+    +----v--+     +---v---+   +--v----+     +---v---+
|       |     |       |    |       |     |       |   |       |     |       |
|       |     |       |    |       |     |       |   |       |     |       |
+-------+     +-------+    +-------+     +-------+   +-------+     +-------+

We will have a normal reference at everything at level 1 for A' but not for A. We'll cover two full examples, first cow'ing from A and then A', then the reverse.

COW from A

Cow the root, this works normally, no extent references modified.
Cow block at level 1.
1. generation < snapshot generation, look up extent reference count.
2. The extent reference count is 1 because of A', we add 1 to this because btrfs_header_owner() == root.
3. btrfs_header_owner() == root and refcount > 1, btrfs_inc_ref(full_backref == 1), set extent flags for this block to FULL_BACKREF.
4. We are at 1 level below root, free our reference to this block for A.
Cow block at level 0.
1. generation < snapshot generation, look up extent reference count.
2. The extent reference count is 1 because of the FULL_BACKREF from level 1. Add 1 to this because btrfs_header_owner() == root as this is implied.
3. btrfs_header_owner() == root and refcount > 1, btrfs_inc_ref(full_backref == 1), set extent flags for this block to FULL_BACKREF.

COW from A'

Cow the root, this works normally, no extent references modified.
Cow block at level 1.
1. generation < snapshot generation, look up extent reference count.
2. Reference count is 1 and FULL_BACKREF is set, btrfs_inc_ref(full_backref == 0) so we push down our normal ref to the children, btrfs_dec_ref(full_backref == 1) so that the full backref is removed from the children.
3. Free our ref for this node, count is now 0, the extent is freed.
Cow block at level 0.
1. generation < snapshot generation, look up extent reference count.
2. The extent reference count is 1, FULL_BACKREF is set so we btrfs_inc_ref(full_backref == 0) to push our normal reference to the children, btrfs_dec_ref(full_backref == 1) to remove the full backref reference for the children.
3. We remove our normal reference for this block the block is freed.

And now for the other way, starting with a cow down from A'

COW from A'

Cow the root, this works normally, no extent references modified.
Cow block at level 1.
1. generation < snapshot generation, lookup extent reference count.
2. We have our reference which is 1, root_objectid != btrfs_header_owner() and FULL_BACKREF isn't set, there's an implied reference for the original owner, which makes our reference count 2.
3. We are not the owner so we btrfs_inc_ref(full_backref == 0) in order to push our reference down to all the children.
4. We free our reference to this block.
Cow block at level 0.
1. generation < snapshot generation, lookup extent reference count.
2. We have our reference from the inc_ref before which is 1, root_objectid != btrfs_header_owner() and FULL_BACKREF isn't set, there's an implied reference for the original owner, which makes our reference count 2.
3. We are not the owner so we btrfs_inc_ref(full_backref == 0) in order to push our reference down to all the file extents.
4. We free our reference to this block.

COW from A

Cow the root, this works normally, no extent references modified.
Cow block at level 1.
1. generation < snapshot generation, lookup extent reference count.
2. There is no extent reference, which means this block isn't shared anymore, don't do anything with the extent references.
3. Free the extent to be re-used.
Cow block at level 0.
1. generation < snapshot generation, lookup extent reference count.
2. There is no extent reference, which means this block isn't shared anymore, don't do anything with the extent references.

The implied reference from the original owner is somewhat tricky, so the logic in update_for_cow() would need to be updated to account for these rules, which are simply

If root_objectid != btrfs_header_owner() && extent flags != FULL_BACKREF, implied reference.
If root_objectid == btrfs_header_owner() && reference exists in the extent tree, we need to convert to FULL_BACKREF and push down as we COW.

Data extent references

Data extent references need to continue to work as before, as we have more complicated operations we can do with data, such as clone. The only change here is we no longer do bookend extents. Currently the way we handle writing to the middle of an existing file extent is this (this is the normal non-compressed/encrypted case)

Split the file extent item.
The right side has offset == 0, and has it's num_bytes adjusted to reflect the new size we reference.
The left side has offset == the end of the split, and we inc a ref to our disk_bytenr because of the new item, using original offset (which is key.offset - file_item->offset) as part of the key.
Insert the new file extent that points to the new extent.

The space that is no longer referenced that exists in the slot where the new file extent was placed is now wasted, as it will not be free'd until the left and right side of the extents are eventually freed.

The new scheme will be the following.

Free the reference for the original file extent item.
Split the file extent item.
Add a reference to the left section of the split file extent, the file extent disk_bytenr and disk_num_bytes will be modified to reflect the actual new size of the file extent, offset remains 0.
Add a reference to the right section of the split file extent, the file extent disk_bytenr and disk_num_bytes will be modified to reflect the actual new size of the file extent, offset remains 0.
Add the new file extent.

The space in the area that the new file extent replaces will be freed to be re-allocated again.

The compression case will remain the same, as we have to have the entire compressed extent to extract the area the file extent points to, but this isn't as bad because we limit our compressed extent sizes to 128k, thus we can only waste 128k-4k worth of space per extent at any given time because of bookend extents, compared to 128m-4k worth of wasted space per extent with the normal case.

Stripe tree - PROBABLY NOT GOING TO DO THIS

EDIT: What I want to accomplish and what @morbidrsa want to accomplish are slightly different things, and trying to combine them will remove flexibility for both of us. I'm still going to tackle relocation, but this idea is being set aside.

This is fairly straightforward, it'll track the phyisical location of the logical address space. Traditionally this was just some math, we had a chunk that mapped the physical offset of a block group, and so we would just take the offset from logical address from the block group and use that offset off of the physical start of the block group.

We would replace this with a stripe tree that actually tracked physical locations for the logical offset, so it could be any arbitrary device and physical offset within that device. This would have the following items, again blatantly ripped off from @morbidrsa with some modifications for my uses as well

struct btrfs_stripe_extent {
        /* btrfs device-id this raid extent lives on */
        __le64 devid;
        /* offset from the devextent start */
        __le64 offset;
} __attribute__ ((__packed__));

key = {
        /* This is the logical address */
        .objectid = logical;
        .type = BTRFS_STRIPE_ITEM;
        .offset = size;
};

struct btrfs_dp_stripe {
        /* Original logical offset, if we were relocated. */
        __le64 original_logical_offset;
        /* How many extents are packed in here */
        __le32 nr_extents;
        /* Reseved for usage later/padding. */
        __le32 resv;
        /* array of RAID stripe extents this stripe is
         * comprised of
         */
        struct btrfs_stripe_extent extents[];
} __attribute__ ((__packed__));

key = {
        .objectid = logical;
        /* This would be less than BTRFS_STRIPE_ITEM so we could find it first. */
        .type = BTRFS_RELOCATION_ITEM;
        .offset = size;
};

/* This would be present if we had remapped a logical extent due to relocation. */
struct btrfs_relocation_extent {
        /* logical address of the new logical address */
       __le64 logical_offset;
};

Relocation - CHANGING FROM MY ORIGINAL IDEA

EDIT: Since I'm not doing the stripe tree I want to handle this in a different way. My plan is to do something sort of like what is described below, but instead make a new REMAP tree. If we relocate a block group we will set a REMAP flag on it's block group flags (maybe, I have to see if I can actually set the flags to something other than data type), and then populate the REMAP tree with where I've relocated the extents inside the block group. On mount this gets loaded up and we will translate any IO to the new logical offset where the extent resides. Once all of the extents have been freed from the block group the remap items will be deleted and the block group itself will be deleted. Of course the chunk will have been reclaimed by the time all of the blocks are remapped in the block group, so the space will be available, just the accounting will be removed later.

The new relocation behavior would be to migrate block groups as normal, but instead of walking the extent tree we would walk the stripe tree finding all stripes that exist in our logical space. We would then go find gaps in other logical areas that could host our stripe, read in the bytes from a btrfs_stripe_extent, and then write them to a new stripe and update the on disk stripe for the extent. We would keep track of the in memory mapping in an extent_map, so the operation would look something like this

Mark the block group read only.
Look up a btrfs_dp_stripe in that block group.
1. If we find a btrfs_dp_stripe then this is our first relocation.
2. If we find a btrfs_relocation_extent then this is our Nth relocation, pull the logical offset out and look up that dp_stripe.
Allocate a new extent in a different block group to give us a new logical offset.
Read in data, write data to the new logical offset.
Update the location of the new logical offset.
1. If the relocation_extent didn't already exist, insert a new one to point at the new logical extent.
2. If it did exist, update the logical extent.
Free the stripe.

Direct and shared bytes tracking per root - PROBABLY NOT GOING TO DO THIS

EDIT: We still have to do lookups after the fact to figure out if a shared extent went from N to 1 references. And we can likely accomplish this behavior with the current qgroup code, just would require some changes to runtime tracking and we don't need to do it on disk.

There is one tricky problem with qgroups, and that is the conversion of a data extent from shared to exclusive. This is tricky because it requires a full backref lookup everytime we modify an extent so that we can determine if we need to convert the file extent bytes from shared to exclusive. There is not much that can be done about this problem unfortunately, however we can make the intermediate tracking much simpler by storing the current shared and exclusive counts in the root items themselves. This would work like the following

1) Every metadata allocation is exclusive, automatically added to the root when it happens. 2) At snapshot time we do the following

orig_root->shared += orig_root->exclusive - nodesize; 
orig_root->exclusive = nodesize;
snap->shared = orig_root->shared;
snap->exlclusive = nodesize;

3) At COW time if we're shared we subtract our ->shared, the ->exclusive gets changed when we allocate a new block for the cow. 4) If the reference count went to 1 we know the root that points at us, or we have a fullbackref. If we have the root we go ahead and convert right there. If we have a fullbackref we mark the block to be looked up.

This will reduce the complexity of tracking these counters across the board, and reduce the amount of backref lookups we have to do.

Conan-Kudo commented 3 years ago

RAID5/6. @morbidrsa has a design doc for a stripe-tree to remove the on disk allocation out of the extent tree. As it stands now we use the extent tree to allocate the actual on disk space, and that is just an offset into a block group, and this maps directly to the physical space. Instead if we push the on disk translation down into a lower layer we can translate the logical address from the extent tree into any physical location on disk, and this allows us to be a lot more flexible with our on disk allocation strategies.

@josefbacik, @morbidrsa: any chance I could take a peek at this design doc? I'm quite interested in reading more about this...

Zygo commented 3 years ago

I, for one, welcome our new non-bookending data extent overlords. It makes dedupe much simpler, and eliminates an entire use case for defrag (where people run defrag to get its wasted-block-discarding side-effect).

Currently the way we handle writing to the middle of an existing file extent is this (this is the normal non-compressed/encrypted case)

...and non-shared case?

Do we figure out block-by-block whether the blocks freed by this write were the last references to those blocks in the data extent? If they were, do we update all the other references to the remaining shared parts of the extent? e.g.

write(subA/fileA, 0..4M); // one data extent, 4M long
snapshot(subA, subB); // data extent is now shared
write(subB/fileA, 1M..2M); // can't delete any blocks because subA still refers to all of them
snapshot(subB, subC); // make some more refs to the shared blocks but not the deleted ones
snapshot(subB, subD); // make some more snapshots so there are refs in shared metadata pages
write(subA/fileA, 1M..2M); // does this free the blocks from 1M..2M in the data extent, rewriting references from subA, subB, subC, subD? Is there a short cut for subC and subD that likely have their refs on a shared page, so we don't unshare shared metadata?

Or do we just fall back to the old behavior when the data extent is shared (i.e. waste the blocks)?

Edit: step 3 can't delete the blocks, but it could update all the references and split the data extent into 3 pieces with different numbers of references. This spends more work in the present to eliminate the need to build a per-block extent reference count map in the future, i.e. all the blocks in any extent are always referenced the same number of times. This might also make things like prealloc and nodatacow faster, as they currently have to build these maps.

josefbacik commented 3 years ago

@Zygo In your example we'd free that 1m-2m area. Here's the step by step of how it would work.

reference for subA logical 0-4M.
No change as we don't update references until the metadata is modified. 3.a. The cow of the leaf adds a reference for subB for logical 0-4m. 3.b. We drop our reference for subB for logical 0-4m, we add a reference for subB for 0-1M, we add a reference for subB for 5m-6m (assuming we allocate sequentially of course) pointing to subB's 1m-2m chunk of that inode, add a reference for subB for 2m-4m.
Nothing changes, because again no metadata is modified so we don't update references.
Again nothing changes. 6a. Free reference for 0-4m for subA. 6.b Add reference for 0-1M for subA, add a reference for 6m-7m for the inode in subA for the 1m-2m chunk, add a reference for 2m-4m for subA. 6.c We have no reference for 1m-2m anymore, this chunk is freed.

Now say you snapshotted subB before you wrote, the yes we'd still point at the original full extent via subC and subD, but that's expected because they really would point at it.

@Conan-Kudo I'll dig it out at some point, but it's not as fleshed out as this, it's more a general proposal, basically I've watered down what he had into my stripe tree description, while leaving him the spots in the metadata to expand for his own uses in the future.

Zygo commented 3 years ago

OK, so what does the data extent look like after step 3b? Currently it would look like:

data extent 0..4m, refs 3:
- subA/fileA bytenr=0 offset=0 length=4M at inode offset 0
- subB/fileA bytenr=0 offset=0 length=1M at inode offset 0 // lower bookend
- subB/fileA bytenr=0 offset=2M length=2M at inode offset 2M // upper bookend
data extent 4m..5m, refs 1:
- subB/fileA bytenr=4m offset=0 length=1M at inode offset 1M // subB/fileA's new extent

References are currently tracked by bytenr, so when we modified subB we didn't have to change anything in subA. But that means when we get to step 6, we can't free the space because we know there are several references to the extent but don't know (or don't want to do the IO to find out) whether the blocks are freed in the middle. So after step 6 we would currently have:

data extent 0..4m, refs 4:
- subA/fileA bytenr=0 offset=0 length=1M at inode offset 0
- subA/fileA bytenr=0 offset=2M length=1M at inode offset 2M
- subB/fileA bytenr=0 offset=0 length=1M at inode offset 0
- subB/fileA bytenr=0 offset=2M length=2M at inode offset 2M
- subC and subD trees refer to pages of subB referencing the above
- no reference to data extent bytenr=0 offset=1M length=1M, these blocks are wasted
data extent 4m..5m, refs 1:
- subB/fileA bytenr=4m offset=0 length=1M at inode offset 1M
- subC and subD trees refer to pages of subB referencing the above
data extent 5m..6m, refs 1:
- subA/fileA bytenr=5m offset=0 length=1M at inode offset 1M

Would it look like this after step 3b instead:

data extent 0..1m, refs 2:
- subA/fileA bytenr=0 offset=0 length=1M at inode offset 0
- subB/fileA bytenr=0 offset=0 length=1M at inode offset 0
data extent 1m..2m, refs 1:
- subA/fileA bytenr=1m offset=0 length=1M at inode offset 1M
data extent 2m..4m, refs 2:
- subA/fileA bytenr=2m offset=0 length=2M at inode offset 2M <- subA is modified here by a write to subB
- subB/fileA bytenr=2m offset=0 length=2M at inode offset 2M
data extent 4m..5m, refs 1:
- subB/fileA bytenr=4m offset=0 length=1M at inode offset 1M // same as we do now

So we solve the wasted space problem because we can clearly see at step 6 that the extent ref count of the 1m..2m extent drops to 0, and we remove the data extent from the extent tree.

But wouldn't this require updating every reference to the data extent whenever a partial reference to the extent is freed? What am I missing here?

As I understand it, after step 5 we have:

data extent 0..1m, refs 2:
- subA/fileA bytenr=0 offset=0 length=1M at inode offset 0
- subB/fileA bytenr=0 offset=0 length=1M at inode offset 0
  - subC and subD's tree references the pages of subB above
data extent 1m..2m, refs 1:
- subA/fileA bytenr=1m offset=0 length=1M at inode offset 1M
data extent 2m..4m, refs 2:
- subA/fileA bytenr=2m offset=0 length=2M at inode offset 2M
- subB/fileA bytenr=2m offset=0 length=2M at inode offset 2M
  - subC and subD's tree references the pages of subB above
data extent 4m..5m, refs 1:
- subB/fileA bytenr=4m offset=0 length=1M at inode offset 1M // same as we do now

so OK that's fine, as long as we are pruning the bookend extents as we go. Not very interesting.

What if we added step 7 at the end:

write(subA/fileA, 3m..4m)

Will that force an update of subB, subC, and subD to match subA's new extent boundaries?

data extent 0..1m, refs 2:
- subA/fileA bytenr=0 offset=0 length=1M at inode offset 0
- subB/fileA bytenr=0 offset=0 length=1M at inode offset 0
- subC and subD can refer to pages of subB here, nothing changed
free space 1m..2m, refs 0:
- this is now unreferenced free space
data extent 2m..3m, refs 4:
- subA/fileA bytenr=2m offset=0 length=1M at inode offset 2M
- subB/fileA bytenr=2m offset=0 length=1M at inode offset 2M // write to subA modifies refs in subB
- subC/fileA bytenr=2m offset=0 length=1M at inode offset 2M // write to subA modifies refs in subC
- subD/fileA bytenr=2m offset=0 length=1M at inode offset 2M // this metadata page was shared, now it's not?
data extent 3m..4m, refs 3:
- subB/fileA bytenr=3m offset=0 length=1M at inode offset 3M // no longer referenced by subA so we modify subB
- subC/fileA bytenr=3m offset=0 length=1M at inode offset 3M // page may have been shared by subB, subC, and subD
- subD/fileA bytenr=3m offset=0 length=1M at inode offset 3M // now suppose there's 100 of these snapshots...
data extent 4m..5m, refs 1:
- subB/fileA bytenr=4m offset=0 length=1M at inode offset 1M // same as we do now
data extent 5m..6m, refs 1:
- subA/fileA bytenr=5m offset=0 length=1M at inode offset 1M // same as we do now
data extent 6m..7m, refs 1:
- subA/fileA bytenr=6m offset=0 length=1M at inode offset 3M // same as we do now

josefbacik commented 3 years ago

This got confusing with all the offsets, and it looks very different from what we do now, I'll write it your way

data extetnt 0m-4m, ref 1
- subA/fileA bytenr=0m offset=0 length=4M at inode offset 0m, 1 ref
No change, no metadata update.
Two references are added for the new bytenr, the original for subB/fileA is split.
- subA/fileA bytenr=0m offset=0 length=4m at inode offset 0m, 1 ref
- subB/fileA bytenr=0m offset=0 length=1m at inode offset 0m, 1 ref
- subB/fileA bytenr=5m offset=0 length=1m at inode offset 1m, 1 ref
- subB/fileA bytenr=2m offset=0 length=2m at inode offset 2m, 1 ref
Again no change, there was no metadata upate.
We split the original big ref for A, drop the part that's no longer referenced.
- subA/fileA bytenr=0 offset=0 length=4m at inode offset 0m, 0 ref // This is the first thing we drop
- subA/fileA bytenr=0 offset=0 length=1m at inode offset 0m, 1 ref
- subA/fileA bytenr=6m offset=0 length=1m at inode offset 1m, 1 ref
- subA/fileA bytenr=2m offset=0 length=2m at inode offset 2m, 2 ref // one for subA one for subB
- subB's stuff

Before you would never have extent references overlapping in the extent tree. I think that's what you're missing here, I would allow extent ranges to overlap, because every file extent will reference the exact length of the real extent ref they hold in the extent tree. This means you could end up with something wonky like this

[0-65m] [0-1m] [5m-65m] [1m-5m] [55m-128m]

If I drop [1m-5m] we immediately get that space freed (well "immediately", it'll be pinned of course).

This does make free'ing waaaaay more complicated of course. Because if we do something like

[0-4m] [1m-2m] [3m-4m]

and we free [0-4m] we have to go and see if we have any references in that range. I will handle this by using our range locking to lock the extent we're messing with. So I'll lock [0-4m], then look to see if I have any overlaps. If there's any range that gets freed it'll get added to the pinned thing and we'll carry on.

There will clearly be a very robust self-test written for this code, because there will be plenty of dark corners.

Zygo commented 3 years ago

I would allow extent ranges to overlap

Ah, indeed that is the missing piece. Then for free space reclaim we just look at the bytenr range covered by the data extent that hit 0 references, and free any blocks that are not overlapped by another data extent, i.e. we don't need to look at any subvol trees, because everything we need to know about which blocks are referenced is available in the extent tree (though possibly up to 128M-4K before the freed item, so it could be a few hundred pages away in Dave-Chinner-style test cases).

Thanks, I think I see how it works now.

josefbacik commented 3 years ago

Block group tree

(I forgot to add this to my design doc, I will edit it once I'm at a real computer).

We load all of the block groups on mount, which we don't necessarily have to do, but changing it could be annoying. At any rate all of the block groups are indexed by bytenr, so they're spread all out on the disk. This makes mounting very large file systems kind of slow. Fix this by adding a new tree dedicated to the block groups, it'll simple have block group items in it and they'll look exactly like they do today. This way they'll be all closely located to each other and we'll be able to mount a lot faster.

fdmanana commented 3 years ago

Existing extent tree

The existing extent tree has a lot of big advantages, but some other disadvantages.

Advantages

Every block is tracked. We can easily (relatively speaking) find out who points at a particular data extent or metadata block.

Reference counting is straightforward. We insert a reference for every allocation, we delete a reference for every removal, when the reference counter hits 0 we free the extent to be re-used for allocation.

Snapshot reference counting is relatively low impact. We only modify references on COW, and thus creating snapshots is a O(1) operation.

However there are some disadvantages that are starting to become more and more of a problem. We also have some other things that are made much worse by the fact that we didn't design a system with the set of features we have today, so I would like to take this opportunity to rethink a few things related to the extent tree in order to set ourselves up better for the future.

Problems that I want to solve

The extent tree is recursive. Because we track every block, modifying the extent tree creates new references which could potentially create new blocks. In practice there is a steady state, however this makes things like ENOSPC handling very complicated, as we have to reserve a lot of space up front in order to make sure we'll be able to make our reservations.

The extent tree doesn't scale well to multiple modifiers. I wrote patches a year ago to cut down on the amount of places we run delayed refs, because we could get a bunch of threads trying to run delayed refs and they'd all hit lock contention on the root node. We could be better about our locking here, but in the end this is a single tree for the entire file system. Breaking this out into per-block group would make a huge impact in our parallelism.

The csum tree is a huge source of lock contention. If you spin up a workload that bombards the system with writing to many files, you'll see latencies for finish_ordered_extent go up because of the threads getting held up in lock contention on the csum tree. Latencies in finish_ordered_extent can cause latencies on anything that needs to wait for ordered extents, like the ENOSPC flushing code or fsync.

Not anymore for fsync, it's very limited now since it only affects the slow and less common path.

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=487781796d302266aff993bee17d4e1ddd73445b

cmurf commented 3 years ago

What's the likely conversion method from v1 to v2? Would it be more or less complex than free space cache v1 to v2? Right now we see a wide range of mount time for initial v2 creation, from seconds to many hours. I'm wondering about the scalability of extent tree conversion, but maybe it's too soon to tell.

josefbacik commented 3 years ago

@cmurf it will be an offline conversion tool, the changes are too complicated to do online on mount like what we did with the free space tree.

josefbacik commented 3 years ago

I've edited my original comment to indicate what the plan is now after thinking about it for a month and starting the work. The hilights are

No more stripe tree. This is because what I want is a little different from what @morbidrsa wants and shoehorning them together is likely to cause more pain than it is to make our lives easier.
The mechanics of the relocation change is slightly reworked because of the above change. The general plan is still the same, but how it works will be slightly different.
The qgroup changes are scrapped. I can accomplish the same behavior without on disk changes, and we still have to do backref lookups for N -> 1 changes on shared extents and there's no real good way to eliminate those.
Added the block group root tree to the original comment.

Conan-Kudo commented 3 years ago

No more stripe tree. This is because what I want is a little different from what @lorddoskias wants and shoehorning them together is likely to cause more pain than it is to make our lives easier.

Who is @lorddoskias?

josefbacik commented 3 years ago

No more stripe tree. This is because what I want is a little different from what @lorddoskias wants and shoehorning them together is likely to cause more pain than it is to make our lives easier.

Who is @lorddoskias?

The wrong fucking person, my bad. I meant @morbidrsa.

josefbacik commented 3 years ago

New snapshot tracking tree and drop trees

One of the main problems I have been trying to solve with extent tree v2 is the explosion of delayed refs for our normal operations. Reworking relocation was a big part of that, tracking all metadata in the extent tree was another, but there's another less well hi-lighted problem, which is how we push down references at shared times.

This is described in my original plan to only track shared metadata blocks, but this still opens us up to a lot of delayed refs for any arbitrary modification. The original COW B-Tree paper had this mechanism where we pushed down references at every COW, and this is how Btrfs has fundamentally operated since the beginning. If we COW block A and it is shared we must add a reference for all of its children before continuing. With a 16k nodesize this can be upwards of 500 references per level. This means for a 3 level tree (the size of the fs tree for a clean fedora install) we'll push 1000 references for every path the first time through after a snapshot.

This tends to not be too big of a deal in practice because this is mostly an up-front cost. First modification sucks, every subsequent modification sucks a little less. However the larger the tree the more suckage and the longer the suckage lasts.

This is the other fundamental problem with ENOSPC tracking in btrfs. Any particular modification is potentially unbounded. We could certainly worst case reserve 500 items worth of reservation for every level in a tree, but that would be overkill most of the time and put a lot of stress on the ENOSPC infrastructure.

Instead we want to bound the modifications to the items we're modifying. If we could push one item per level that would be easy to account for, we have a maximum of 8 delayed refs for every search. This is easy to account for.

Enter the snapshot tree and drop trees. Instead of tracking references per metadata block, we'll keep track of who no longer references blocks per metadata block. We will then use the snapshot tree to determine who theoretically refers to a particular block.

The snapshot tree will have entries like this

struct btrfs_key key = {
    .objectid = root->root_key.objectid,
    .type = BTRFS_SNAPSHOT_ITEM,
    .offset = snap->root_key.objectid,
};

struct btrfs_snapshot_item {
    u64 generation;
    u64 sub_generation;
};

On snapshot we will copy the root like we always do, insert the snapshot item with the source tree objectid and the destination tree objectid, the generation (transid) and a sub_generation, which will change every time it is snapshotted in the same transaction. Using these items we'll be able to build a graph of shared blocks within a tree.

For example, consider subvolume A that has had a snapshot at generation 9 by subvolume B and a snapshot at generation 20 by subvolume C. Our snapshot items will have A->B gen 9, A->C gen 20. For a block owned by A with a generation of 7 we will know that it is potentially pointed at by A, B, and C. This is where the drop tree's come into play. When we cow this block we will add a drop entry for our given volume. The drop tree's will look like this

struct btrfs_key key = {
    .objectid = bytenr,
    .type = BTRFS_DROP_ITEM,
    .offset = generation,
};

struct btrfs_drop_item {
    u64 drop_count;
    u64 sub_generation;
    u64 root_ids[0];
};

// If we run out of space in the drop item we'll use the following keys for the dropped root objectids
struct btrfs_key key = {
    .objectid = bytenr,
    .type = BTRFS_DROP_ROOT,
    .objectid = root_id,
};

The root_ids will just be an inline list of all the object id's that have dropped their reference to the block, and if we run out of space in the item we'll simply use the BTRFS_DROP_ROOT key type to track the other root items that have been dropped. Since we can calculate the shared count for bytenr's easily because of their owner and generation we can free the block once our drop_count == shared_count.

With this we eliminate 2 things

Extent reference lookup at btrfs_search_slot time. Currently if we're shared we're potentially doing a extent tree lookup for every block we hit while doing a search. We no longer need this, if we are shared then we simply add our drop delayed ref and carry on. When the delayed ref runs it'll determine if it needs to do any freeing.
The delayed ref explosion as we COW down a path. We now only add our drop delayed ref at every level if we are shared.

We will use this for data as well. We will track drops of the data extents via the drop tree, and once the drop_count == shared_count we'll remove the extent reference from the extent tree like normal.

The side effect of this will be that we no longer can do backreference lookup. This is actually OK, and makes things like Qgroups insanely cheap. We use backreference lookups to determine what roots point at specific blocks. Now we can do that with 0 tree lookups, we can calculate it from in-memory cache of the snapshot dependency graph. If you want to know what current roots point at an arbitrary block you can simply take the list of theoretical references and subtract any objectids from the drop tree and have the current list of references. Qgroups becomes much less search intensive and more scalable to many snapshots.

The sub_generation is a thing we need to handle modifying source tree's in the same transaction that they were snapshotted in. Because we use the generation number to determine our potential sharing, this becomes problematic if we COW during the transaction that we were snapshotted. We solve this with a sub_generation, so every snapshot of a subvolume will increase their sub_generation for that transaction, and we will use that to determine which blocks are actually shared by which snapshot. For example, say you have subvolume A and it is snapshotted in the same transaction by both B and C. Part of the snapshot operation is adding a link to the new snapshot, which will modify blocks in subvolume A. However they will have the same generation as the snapshot generation, as we will not have committed the transaction yet. These blocks will be marked with a header flag BTRFS_HEADER_FLAG_SUBTRANSID or something like this and have an entry with a sub_transid. The snapshot for B will have generation == transid, sub_transid == 0. Then when we cow a block to add the link it'll get generation == transid, sub_transid == 1, so we know that this block is not shared by B. Then when we create the snapshot C it'll get a generation == transid, sub_transid == 1, and we know that block is now shared by A and C. When we cow down A to insert the entry for C we'll mark those blocks with BTRFS_HEADER_FLAG_SUBTRANSID and insert a drop item with a sub_transid == 2, so we know it is not shared by either B or C.

The sub_generation is a little tricky, but is needed to allow us to COW blocks of a snapshotted root in the same transaction that it was snapshotted. This will allow us to loosen the restrictions on snapshotting slightly, and perhaps allow us to create snapshots even faster than we currently do.

josefbacik commented 3 years ago

I made a series of videos to explain all of the changes so far in case reading this is confusing. Unfortunately I recorded two of them before I realized my cursor doesn't show up, so sometimes I do things like say "This block" and you have no idea what I'm pointing at. Sorry about that. Happy to make more videos to clear up other confusing aspects of the extent tree v2 changes.

https://www.youtube.com/watch?v=TH1Ho19dqP0 https://www.youtube.com/watch?v=wIevirZtF8k https://www.youtube.com/watch?v=_7RAGUUBaPI

Artefact2 commented 3 years ago

Regarding the remap tree, aren't you concerned about slowing down all the reads because of the extra logical->physical lookup?

josefbacik commented 3 years ago

Regarding the remap tree, aren't you concerned about slowing down all the reads because of the extra logical->physical lookup?

Yeah for sure, but it's going there's a bunch of upsides that will more than pay for the cost. The worst case is

Lookup the mapping for the range we're reading (because writes will no longer go into this range).
Split the BIO for ranges that are not contiguous. The remap tree allows us to split all the way down to sectorsize, so before we weren't allowed to break up contiguous extents, now we can. The drawback is that suddenly your old read BIO that would be 1 BIO will now be split up into size/sectorsize BIO's. This is the worst case clearly, but a possibility.

Now for what we gain.

We can split extents into arbitrary sizes. Yes this is also a con, because now we have to split, but it allows us to pack in extents into sparse block groups to get better space utilization. This is nice for things like zoned block groups that really want to not have sparsely allocated block groups.
Waaaaaaay less complicated code. I'm the only person who understands relocation, and only barely.
No more delayed ref explosion. Zygo can make us get 30 second commit times because of how the current relocation works with delayed refs. The only way to fix this problem as it exists currently is with a disk format change to save where we are in the reloc merging step. So if I'm going to need to change the disk format, might as well make it suck less.
We don't have to know anything about the actual data we're copying. With encryption we use the logical address as the IV (at least how it's implemented in patches currently I think), so to relocate encrypted blocks we'd have to decrypt and re-encrypt as we relocate extents. This allows us to not care, we just copy it around and the logical address will stay the same.
The way relocation currently works is to create snapshots of every subvolume in the fs as we move things around. This explodes out the extent tree and explodes out the metadata usage. It makes qgroups more complicated, and it makes backref lookup take longer. With the remap tree the only metadata modification that happens is adding the remap objects, everything else stays the same. The extra overhead of having more snapshots with the current scheme likely costs as much if not more than the cost of the remap tree.
Again the current relocation walks the entire extent tree and copies each extent to a new location and then goes back and updates all the metadata to point at the new location. The remap tree will simply copy everything and add remap objects, which is significantly faster and less error prone than the current scheme.

This is a file system, there's always a cost, but the perf cost is likely paid for by no longer incurring the perf penalty of unsharing blocks in the old scheme. Even if it didn't, all the other upsides are definitely worth the cost.

Zygo commented 3 years ago

I was meaning to comment more on this earlier, but haven't had the time to really dig into it.

Lookup the mapping for the range we're reading (because writes will no longer go into this range).

It's the same average cost as looking up an (inode, offset) pair while reading a file through a subvol, and looking up the bytenr for csums. Both are O(log(n)) lookups with roughly the same item size. For a given file, the key density for (inode, offset) pairs, logical addresses, and physical addresses are different, so some reads will have negligible extra costs (when the logical addresses are contiguous, the next csum/reloc item will be in memory 99.6..99.8% of the time), while other reads will have much higher worst-case costs (when the logical addresses are heavily fragmented, the next extent in the file will require hitting the disk for O(log(n)) csum/reloc tree pages) but again this extra cost is roughly the same as the csum lookup cost, so however bad it gets, its upper bound is a doubling of the cost that is there already for datasum files.

It is an extra cost for read-heavy workloads, but right now write-heavy workloads stop the filesystem more or less dead for more or less unbounded periods of time.

I do have a few concerns about the "relocate everything" cases, where every extent in the filesystem ends up in the remap tree. ZBD devices are AIUI balancing block groups every time they want to delete anything, and adding or upgrading disks in a RAID array requires relocating significant chunks (up to 100%) of the filesystem. If there is nothing putting pressure on the remap tree to shrink, and logical addresses aren't modified, they're going to get more and more fragmented over time, the remap tree will keep getting bigger, and logically sequential addresses are going to point to more random physical addresses. Maybe some background thread could come along later, rewrite the logical addresses, and reduce the remap tree size (either deleting or coalescing nodes)? It could be very lazy, to the point of not doing any work at all if the disk is never idle. This would require figuring out how to do relocation in tiny increments that don't blow up delayed refs, but presumably it's easier to do that after the reloc and snapshot trees are in place.

Does it make sense to make the remap tree separate from the extent tree? If one common case is "remap tree is empty" and another common case is "remap tree contains every extent", we could put the physical location remap items in the extent tree. I think I can talk myself out of this idea though--we'd have to wade through backref items to get to the reloc items, and maybe that means the tree is always full of 50% of data we don't want for the particular operation we are doing (i.e. you'll often want either the backrefs or the physical locations, not both), which would make separate trees better.

Zygo can make us get 30 second commit times because of how the current relocation works with delayed refs

Heh, I'd love to have 30-second commit times. While balance is running, 40% of the machine time is spent in commits running longer than 60 seconds each. I have about 30 events per machine-day where commits take 2-6 minutes. Add another order of magnitude for putting metadata on spinning rust.

The commit times are now the biggest single barrier to rolling out btrfs for new use cases. Crashes and hangs are gone, we can live without raid5, raid1 life cycle is crude but tolerable, but a lot of applications assume the server has crashed if a write takes more than 2 minutes. Can't run any of those on btrfs yet, because btrfs doesn't have bounded latency.

Zygo commented 3 years ago

It's the same average cost as looking up an (inode, offset) pair while reading a file through a subvol, and looking up the bytenr for csums. Both are O(log(n)) lookups with roughly the same item size. For a given file, the key density for (inode, offset) pairs, logical addresses, and physical addresses are different, so some reads will have negligible extra costs (when the logical addresses are contiguous, the next csum/reloc item will be in memory 99.6..99.8% of the time), while other reads will have much higher worst-case costs (when the logical addresses are heavily fragmented, the next extent in the file will require hitting the disk for O(log(n)) csum/reloc tree pages) but again this extra cost is roughly the same as the csum lookup cost, so however bad it gets, its upper bound is a doubling of the cost that is there already for datasum files.

...which reminds me (and this is straying a bit from the topic): delayed refs processing on big, old, spinning filesystems is something like 40% deleting extent items and csums, most of that time waiting to read random metadata pages because no two logically adjacent blocks are physically anywhere near each other after a filesystem has aged a while. I wonder if there are gains to be had by sorting those, or flushing all the delete operations out to a tree, so we get the commit over with quickly and unblock applications. A background thread can clean up and fully delete stuff after the transaction commits.

josefbacik commented 3 years ago

Does it make sense to make the remap tree separate from the extent tree? If one common case is "remap tree is empty" and another common case is "remap tree contains every extent", we could put the physical location remap items in the extent tree. I think I can talk myself out of this idea though--we'd have to wade through backref items to get to the reloc items, and maybe that means the tree is always full of 50% of data we don't want for the particular operation we are doing (i.e. you'll often want either the backrefs or the physical locations, not both), which would make separate trees better.

Well we no longer have extent trees for non-data block groups. The other thing is we have to stuff the remap tree in the system chunks in order to make sure we can read them before we start trying to read in the block groups.

As for the "relocate everything" case, yes if you don't change anything it kind of sucks, but as block groups are free'd their remap items are deleted. And then for multiple remaps' we simply update the new location for every remap'ed item.

For example, we remap the block group 1gib-2gib. Say there's a single extent, 1gib-1.5gib, we insert a remap item for the new location, but we also add a remap backref with the new location as the key.objectid. That way if we remap the block group that has the new location for another block group, we simply go and update the existing remap items with the new locations, so we don't have multiple steps of indirection, we only just have one.

Finally we have to remember that the remap items only refer to allocated area in the block group. This doesn't refer to actual extents. So if we have 500 1mib extents in that 1gib-1.5gib range, we're only going to get 1 remap item for that whole range. You're only going to get a remap item for whatever you actually relocate. It can certainly be more than the number of extents, but it can be less as well.

Additionally in the worst case we're talking a max of ~6mib to describe all the remapped items in the block group, and we only need 3mib of that in memory. The plan is to load this entire tree into a mapping at mount time, so the remap operation is purely CPU/memory overhead, no going to disk to look up our new location.

josefbacik commented 3 years ago

...which reminds me (and this is straying a bit from the topic): delayed refs processing on big, old, spinning filesystems is something like 40% deleting extent items and csums, most of that time waiting to read random metadata pages because no two logically adjacent blocks are physically anywhere near each other after a filesystem has aged a while. I wonder if there are gains to be had by sorting those, or flushing all the delete operations out to a tree, so we get the commit over with quickly and unblock applications. A background thread can clean up and fully delete stuff after the transaction commits.

Keep in mind that we're no longer tracking metadata references in the extent tree, so the only "deletions" for metadata blocks will be adding drop items for any shared metadata blocks, so significantly less delayed metadata operations per transaction, which will help everybody.

For checksums we could delay checksum removal, tho it makes the ENOSPC tracking annoying because we'd have to transfer the ENOSPC reservation from the delayed ref to the actor on the csum tree. But that's the only annoying part, we could easily just use an extent_io_tree to mark csums that need to be deleted and post-process that as a last step in the delayed ref running.

Zygo commented 3 years ago

So if we have 500 1mib extents in that 1gib-1.5gib range, we're only going to get 1 remap item for that whole range.

OK so that's different from how ZBD/ZNS seem to use relocation. They want to do garbage collection. So if half of 500 extents are deleted and garbage collected, they'd be looking at say 125-249 reloc items? 125 for packing half of the remaining extents into the spaces between the other half; 249 for squashing all the extents up against the first one. Or a simpler example with 10 extents:

   ABCDEFGHIJ  <- original extents
   A.C.E.G.I.  <- remaining after deletion, "." is free space gaps in the BG
   ACEGI.....  <- garbage collect by copying to new BG in order
                 (pack all extents except the first one together, 5 reloc items added to tree)
   AICGE.....  <- garbage collect by packing in free space
                 (not really sane except to reduce reloc tree size, 3 reloc items added to tree)

Additionally in the worst case we're talking a max of ~6mib to describe all the remapped items in the block group, and we only need 3mib of that in memory

OK, so 188GB in memory for a 50TB filesystem...hope that's pageable? :'O

I mean it obviously works in simple relocation cases (like you have 5 raid devices and you want 6) where you just want to pick up a block group and drop it somewhere else without changing anything about its structure, even the free space inside it. The garbage-collection use cases don't seem to fit in that model, though. They seem better off with the traditional relocation, or maybe an enhanced version of defrag that coalesces the physical extents while moving them.

Zygo commented 3 years ago

we could easily just use an extent_io_tree to mark csums that need to be deleted and post-process that as a last step in the delayed ref running.

I was thinking of pushing them all the way out of the transaction. Have something like log tree on disk, called delete tree, and all it does is record which extents and csums have hit zero references and need to be removed so we can get the transaction over with. Then btrfs-cleaner or similar comes along and does the actual removal (i.e. go fetch the extent and csum tree pages, remove items from them, repeat) later on, without blocking anyone because it can simply stop consuming extents from the delete tree at any time. Maybe if the filesystem starts getting really full we go back to deleting csums in-transaction to get some space freed up, but if used space is trending downward (as we'd expect it to be if we have a huge pile of csums queued up) then there's no particular rush.

As I write this, I open a ssh window on a host, run watch -n1 cat /proc/$(pidof btrfs-transaction)/stack, and see __btrfs_free_extent or btrfs_del_csums on the stack trace nearly continuously, while all writers on the filesystem are blocked. It makes deleting things so painful that we sometimes have to schedule deletes in big overnight batches.

josefbacik commented 3 years ago

I'm going to set the csum thing aside for now and focus on the remap tree for now.

You make a good point for the 50tb case. This wouldn't work out exactly like that, since you're filling block groups as you relocate them, and if you write 50tb and then delete every other 4k block you get what you deserve, but it certainly hi-lights the drawbacks of this solution.

Lets talk about what we want relocation to do then. It's essentially garbage collection. In it's current state it uses the extent tree as it exists with all of its backreferences to find all of the blocks in a block group, copy them somewhere else, and then go back and update all of the pointers to point at the new location.

In its current state, that means if you have 1000 snapshots that share a block, it must update 1000 paths to that block to point at the newly relocated block. This currently has to be done in a single transaction, and because we're doing the btrfs_inc_ref() as we walk down we're talking anywhere from 500k to 4million delayed refs in a single transaction. This is where your 45 minute transaction commit shows up. This also means we only continue to share this one block and every block underneath it. If it is a leaf then we've just added 8k new blocks so we could "share" this one relocated block. For every level up we're saving 1000 blocks, but it's still a pretty large space explosion.

Can I make the current scheme less shitty? Sure, I can add stuff to track where I am in the merge process. This allows us to ratelimit the amount of delayed refs we generate per transaction. However relocation still is a function of how many paths to a shared block we have to update. So the relocation itself is still going to take all eternity, and take longer for every snapshot we have.

Can I make it better WRT the number of snapshots? Maybe. I could do the inverse of what relocation does today. Instead of walking down the reloc tree and then relinking a path for a snapshot to the new block, I could simply build a tree that matches the original snapshot and then just swap the root to point at the reloc root. This only really works if the snapshot doesn't change tho. So it helps my stupid case of 1000 snapshots because likely only 1 root is being modified and the rest are staying static, but in reality this isn't how things work, every root is going to have slightly updated stuff, so keeping state and swapping whole trees is super fragile and likely to break in a horrifying way.

Finally snapshot aware defrag is a similar problem, just limited to data. It is bound by the number of referencing trees, and we're exploding usage for every operation to update the tree.

And all of these still operate with the original extent tree. With extent tree v2 it is impossible to figure out who owns discrete blocks simply from the bytenr. If we read the actual block we can determine who references us, which means we have to walk all trees looking for a block in a particular block group. If we wanted to make old fashioned relocation still work I wouldn't be able to change the extent tree like I want to.

Which brings us to why do we want relocation in the first place. Well because we divide up metadata and data allocations into specific block group types. What if we stopped doing that? Well we could certainly do away with block groups completely, simply have a device tree that matches stripes, and now we can only operate on entire devices instead of individual chunks. This is not as horrible as it sounds, in the end all we really want to accomplish is having blocks striped in the way we configure, so the extra granularity just adds flexibility but headaches because we can't mix things.

But we have zoned support, which does actually have a certain granularity that it wants to operate on, the zone size. So we can't just remove the notion of block groups, we have to keep them around. And we need to be able to garbage collect, so we have to keep some sort of GC method around.

Snapshot aware defrag is hugely problematic given how btrfs is designed. Since all of the metadata is in one tree we can actually end up with a huge amount of unsharing between snapshots with just basic things like atime updates. This means that we can't easily whole-hog replace extent mappings for an inode, because we could have modified unrelated metadata that shares paths with the extent items and thus prevents us from doing easy things like swapping in whole subtrees into a snapshot.

We could do something like per-inode extent trees. This would be good for large files with lots of extents, but you don't want to waste a 16kib leaf for every inode on the file system, so you still have to contend with the unsharing of lots of snapshots.

The real thing that hurts us here is concurrent modifications of the trees. If we're unsharing random things we have no choice but to unshare loads of paths to splice in new locations. We could solve this by disabling concurrent modifications of file system tree's while we move shit around. This would certainly work nicely from a complexity standpoint, because then we can simply move shit around for one tree, and then update all the other trees to point at the new paths. In our 1000 snapshots thing this means we're updating 1008 blocks, which is pretty nice. What's not nice is nobody can write to any of these snapshots while we're doing relocation. This is like fs-freeze, only worse because the time we're frozen is bound by the number of snapshots we have.

Which brings us back to the mapping tree. It's even worse than you hilight, because again we are forced to stick these in the system chunks because we have to bootstrap their location via the superblock at mount time. So we can't actually store 188gib of remap times, we can store as much as system chunks that we can have, which is in the worst case 14, so 14gib total for storing the block group and remap tree. Which even with no relocation suddenly makes our maximum FS size around 390 petabytes instead of 8 exabytes. With remap things we'll cut down on the amount of block group items we can have, and we could make it so you weren't allowed to have more than a couple of TiB (in the worst case of course).

I could split the block group and remap trees into per-type trees. Then only stick metadata block group and metadata remap trees in the system chunks and then the data block groups and data remap trees in the metadata areas. Now we're only limited to 393 pb of metadata, or a few TiB of metadata in the worst case. This is an easier pill to swallow I think, but still kind of awkward.

Another thing I could do would be to limit the size of the remap items. Instead of letting us remap sectorsize granularity, we limit ourselves to a specific chunk size, say 8mib. This turns our worst case remapping into 6kib, so our theoretical 50tib fs only needs 340mib to describe the entire file system. This wouldn't allow us to compress down as nicely, and in fact wouldn't let us relocate this 50tib fs because we wouldn't be able to find 8mib chunks, but this may be an acceptable trade.

This is a long post to say that I've thought long and hard about this problem. Relocation just sucks, there's no good approach that works well.

The remap tree can be made to be less awful in the worst case, and allows me to still make all these other changes that I want to make. With my current plan I eliminate unbounded modifications and delayed ref generation. If I were to leave relocation in place like it is I would have to leave the extent tree like it is, which means leaving ref count pushing like it is, which leaves us in the current state WRT delayed ref updates for shared blocks. I think making our garbage collection simpler but more coarse is a fine price to pay for what we're going to get in return.

josefbacik commented 3 years ago

My original plan was to stuff the remap tree and block group tree in the system chunks, and then simply cow any blocks in a system chunk for any relocation of the system chunk instead of doing the remap thing for system chunks.

Clearly this has all the drawbacks I listed above, we're limited on system chunks.

However I still need this property, so I'm going to add another chunk type that will hold the remap and block group tree, and give it the same properties. This way we can have as large of a remap tree as we want and not have to worry about it. It'll be

system chunks - chunk tree.
mapping chunks - remap and block group trees.
metadata chunks - everything else.

Zygo commented 3 years ago

With extent tree v2 it is impossible to figure out who owns discrete blocks simply from the bytenr.

That has some disturbing implications for dedupe, but at the same time maybe there's a better way to dedupe?

Dedupe is in a sense like a single-extent relocation, except that instead of copying data from point A to point B, we claim credit for a copy someone else previously did, update all the references to A so they are now shared references to B, then free the space at A. Right now dedupe metadata updates are even worse than the snapshot relocation case. Snapshots can share some interior nodes along the paths between extents and roots, while dedupe has to permanently unshare all nodes along every path as well. It's possible to optimize that a little since all the shared leaf nodes get exactly the same modification--but the best case in terms of iops and space consumed is still equivalent to "one-extent balance."

Ideally, we do dedupe by reading data blocks the same way scrub does (now), which gives us some important benefits:

we can read the csums out of the csum tree instead of reading the data blocks most of the time, and read data blocks sequentially when we have to (e.g. compressed or nodatacow). This can be fast enough that we don't have to mess around with incremental scanning (i.e. run it on the whole filesystem every day, or break the filesystem into 30 sections of 1000 block groups each and dedupe one section every day, for a full dedupe every month).
we never touch any data block more than once if it's unique. This is very important when the filesystem has both snapshots and reflinks to multiply the number of references for popular extents into the millions.

When we find a duplicate data block, we have to figure out what files the block belongs to, so we can update data pointers. We also have to know which of two duplicate extents to keep, e.g. we want to keep the one with more existing references, or the one with the longest run of contiguous duplicate blocks, or whatever metric we're using for extent cost model. Both of those require a bytenr to list_of_forward_refs map. btrfs currently provides LOGICAL_INO, so btrfs can run dedupe tools like bees which work from the data blocks upward (as opposed to duperemove and dduper, which work from the files downward).

Remap tree (assuming all scaling issues are resolved) could make dedupe possible without touching subvols at all. A dedupe agent can find identical data blocks and remap them without knowing or caring which files they belong to. The gotcha is that AFAICT remap tree doesn't handle overlapping/shared references at all, and even if it did, it's not scalable to millions of duplicate files (or maybe it is, but its performance is equivalent to csum tree?).

If the filesystem doesn't provide a reverse map (like LOGICAL_INO_V2 or GETFSMAP from xfs/ext4) or a magic physical remapping layer that allows a deduper to ignore files completely, then the deduper has to build a reverse map by walking all the trees itself. Even if the filesystem provides something like TREE_SEARCH_V2 to efficiently rip the subvol trees, the reverse map is still a pretty huge pile of metadata to maintain in userspace. On the other hand, "not maintaining a huge pile of metadata in the filesystem any more" is an explicit goal of extent tree v2, so it might be fair to push that work back onto userspace.

To be clear, I hate it because I'm going to have to do a lot of work if I want bees to work with extent tree v2, not because extent tree v2 is not a good idea for btrfs.

It's probably possible to build a logical->(inode,offset) map and keep it up to date incrementally with tolerable effort (TREE_SEARCH_V2 subvol scans filtered by transid ranges to put data in, monitor block group and subvol lists for deletions to take data out) and error rate (we can tolerate a lot of inaccuracy because we assume we're losing races against user writes anyway, and let the dedupe ioctl filter out stale metadata). I was considering this approach as a way to get bees running on filesystems that have neither GETFSMAP nor LOGICAL_INO. This is of course assuming that extent tree v2 doesn't horribly break TREE_SEARCH_V2 as well...

josefbacik commented 3 years ago

Ooops sorry @Zygo, it'll no longer be possible to say "who owns this METADATA block", we'll still track all the same extent information per inode. So all of the normal stuff we can do will work, and with the snapshot tree we'll be able to very cheaply map a shared extent to all of the referencing subvolumes. The only thing we're losing with extent tree v2 is being able to easily pinpoint which metadata blocks are allocated where, and who owns them. We'll have to do that by walking trees.

Zygo commented 3 years ago

who owns this METADATA block

Oh OK that's much better. Though I did get a few good ideas from the exercise of figuring out "how could bees work without LOGICAL_INO?"

It would be really handy if there was an ioctl for dedupe that could change data pointers in shared metadata blocks without unsharing them (or changing length since that might require inserting new items). I had been thinking of doing something along these lines for extent tree v1 (where we can use the backreferences and walk upward from the data extent) but never found the time to try it (all the time I did find got burned trying to understand how balance does what it does).

If we need a top-down tree walk to work, bees could provide a map of millions of duplicate (bytenr, length, generation) pairs at a time to amortize the cost (the generation is there to detect modifications--if the extent at bytenr doesn't match by the time tree walk gets to it, then dedupe is skipped). It would basically be a two-phase algorithm: phase one uses memory on a hash table to discover duplicate extents, phase two loads the duplicate extent map into memory where the hash table was, and walks the file tree again, replacing any extent on the list as it goes. Two-phase dedupe algorithms suck, but it would be worth it if we can keep shared metadata pages shared.

easily pinpoint which metadata blocks are allocated where, and who owns them. We'll have to do that by walking trees.

That might still be a challenge for ZBD/ZNS garbage collection, but metadata is a much simpler case: all blocks are the same size, there's many fewer of them than data, and there's a steady stream of new ones. Garbage collection could lazily insert new metadata writes between old metadata block copies, and it would only need one remap item to cover the beginning of the block group to the write pointer. Or something like that--ZBD/ZNS isn't really my bailiwick.

josefbacik commented 3 years ago

Some notes from today's btrfs developer meeting.

per-bg roots may be too many roots, especially as the file system grows. We could go for larger block groups, but for things like zoned support that isn't necessarily an option. Instead the plan is to allow you to specify a number of "chunks" to split the file system into for scalability. We'll then simply hash the bytenr to figure out which csum/extent/free space tree to use. This gives us the scalability that we want without having 3k new roots for a 1tib file system.
@fdmanana wanted a log_version added to the sb so he can change the log type without messing with incompat bits.
Nikolay is going to benchmark some different node layout schemes to see if we want to change to something like eytzinger while we're changing things.
We've talked me out of a offline conversion tool. It would be annoying to maintain in the long term and could possibly be a source of other bugs that we would have to address. Re-creating file systems sucks, but so does corrupting them with an offline tool in a way that's unrecoverable.

Conan-Kudo commented 3 years ago

We've talked me out of a offline conversion tool. It would be annoying to maintain in the long term and could possibly be a source of other bugs that we would have to address. Re-creating file systems sucks, but so does corrupting them with an offline tool in a way that's unrecoverable.

So there would be no conversion at all? That would mean we could never get rid of extent tree v1, right?

kdave commented 3 years ago

I don't want to write off the conversion tool idea yet. We have the infrastructure in place in convert and even with the advanced features like extent sharing and whatnot, it's still doable in the same way - writing new metadata into the holes and switch superblock at the end. In some sense I think this would be even easier as both versions would use the same code for modifying trees, the difference is in the logical layer of the extent tree.

kdave commented 3 years ago

I have a requirement that would allow a something not possible with v1. We've discussed that elsewhere, so I'll repeat it here too.

Subvolume groups. Now we have one global space where the extents can be shared, either by reflink or as a snapshot. The idea is to add more such extent sharing domains. The toplevel one would be always there. Extent sharing cannot cross the domain, which is the only visible difference to what we have now.

A followup feature is encryption, that works on the extent sharing domain and allowing to do things in a different in each group. Things like assigning different keys, full/partial/no metadata encryption.

Based on the discussion, this should be possible with v2.

josefbacik commented 3 years ago

I typed a whole update and then closed the tab, I hate my life. Here's an update of the problems we've been encountering with the extent tree v2 design

https://www.youtube.com/watch?v=BT-FwntK7RI

tl;dr We are adding a garbage collection root to handle some of the more complicated free'ing edge cases. This is to reduce the amount of operations that are required to occur before a transaction is allowed to commit. This is all Chris's idea, blame him when it goes poorly.

Conan-Kudo commented 3 years ago

A followup feature is encryption, that works on the extent sharing domain and allowing to do things in a different in each group. Things like assigning different keys, full/partial/no metadata encryption.

This would be extremely appealing for Fedora Linux, as we're actively trying to come up with a strategy for seamless full disk encryption that doesn't require reformatting and reinstalling. If v2 gives us the flexibility to pull this off with strong encryption all the way down to metadata, then this would be utterly fantastic. Of course, this would also include being able to do "blind" send/receive (that is, send/receive without decrypting the volume or the data for the send stream).

kdave commented 3 years ago

Of course, this would also include being able to do "blind" send/receive (that is, send/receive without decrypting the volume or the data for the send stream).

Yeah, that's independent and covered by Omar's patches.

Conan-Kudo commented 3 years ago

Of course, this would also include being able to do "blind" send/receive (that is, send/receive without decrypting the volume or the data for the send stream).

Yeah, that's independent and covered by Omar's patches.

Which ones are those? I haven't seen anything land recently for this...

josefbacik commented 3 years ago

Extent tree v2 changes

Per-bg roots are now simply N global roots

The original plan was to allocate a new csum, extent, and drop root for every block group. This would make lock contention better, but at the cost at really exploding the amount of roots we would potentially have to update per transaction commit. We could solve this partly by increasing the default BG size, but this isn’t possibly for zoned block groups.

Instead we will pre-allocate N number of roots at mkfs time. This can be user tunable, but will basically pick NR_CPUS as a default. Block groups will then point at their set of csum, extent, and drop roots on creation. So if we have 4 CPU’s, we’ll have 4 csum roots, 4 extent roots, and 4 drop roots, and then block groups will point at one of these 4 ID’s for their set of global roots.

This solves the lock contention problem without drastically exploding the number of roots that could be modified during a transaction. This also allows us to in the future have global roots for other things. For example if we wanted to have per-subvolume csum roots we could potentially do this by adding a field to the root to point at their own personal global root id set.

We need to keep bookended extents

Unfortunately my scheme to do overlapped extents and free space as required will have to go. The overall plan works fine, but if you through qgroups into the mix it suddenly becomes a nightmare. At that point we can’t really tell if a data extent is shared or exclusive, because there could be overlapping ranges, so we would have to go to the extent tree to check if there were overlapping ranges at every given point. We will have to keep the current scheme as it is to make qgroups not suck.

A possible future option is to post-process these extents in order to free up space, but that’ll be outside the scope of this change.

Open questions

The snapshot+drop root tree is a nice solution for tracking changes, because it only requires 1 tree modification per level of a shared tree, and it allows us to easily figure out what is shared where, sort of.

Drop tree pros

Easy to work out possible root references.
Single tree update when un-sharing individual blocks.
Easy to determine who is still referencing a specific block with a single lookup in the normal case.
Drop tree cons
Snapshot chaining makes things complicated. This is having subvolume A which is snapshotted by B, and then B is subsequently snapshotted by C. In this case we would think that a block in A is referenced by C, but if that block was COW’ed by B before C was created then this isn’t the case. To figure this out we would need to track the generation of every drop entry so we could determine if C actually referenced the block. This is more space expensive, and more computationally expensive.
The worst case is very space expensive. Consider 1 million snapshots of A, and we rm -rf all 1 million snapshots of A. We now have every block that was referenced in A with 1 million drop entries each. If A had 100 blocks, that’s 100 million entries, and at 16 bytes (root id and generation) that’s 1.6gib of space usage in the worst case.
Modifying the original subvolume during the same transaction as the snapshot is very tricky. We have to implement a sub-transid mechanism to figure out if a block with generation == snapshot generation is actually shared or was added after the snapshot was taken. This isn’t overly complicated in practice, but is annoying.

Chris’s proposal is to essentially dump the delayed refs tree to disk. Our most recent compromise was to do both things, to have the snapshot+drop tree and then have a garbage collection tree to handle the “cons” cases above after the fact.

However the con’s for drop tree+snapshot tree are very wonky and don’t really fit well into any particular mental model. You just have to know that these are problems and we have to code around them and hope they don’t hurt us in the long run.

Alternative proposal

We do what Chris suggested early on, and simply push our delayed ref stuff to an on disk tree. This has the upside of keeping our reference counting the way it has always been, and is relatively straightforward to keep track of. It has all the downsides of our current approach, however we can make those downsides less painful.

Part of what makes our current approach so painful is that every delayed ref must be run in that transaction. Because of this we can get bad cases where we’ve completely blown out the transaction commit time because we have to run the delayed refs.

However if we simply put this work into an on-disk tree, we can post-process the extent tree updates at any point, allowing transaction commits to happen at any time.

Furthermore we can take our most expensive operations, btrfs_inc_ref() and btrfs_dec_ref() and turn them into single items in the delayed ref tree. This doesn’t mean we avoid the cost associated with btrfs_inc_ref() and btrfs_dec_ref(), but it does mean we can delay it to the point where we may have less churn on the file system in general. Meaning we will still at some point have to actually do the work of btrfs_inc_ref() and btrfs_dec_ref() in the future, but we could end up cancelling operations out and end up with less actual churn on the extent trees.

Pros

Slightly easier from a code-stand point. This would be way less drastic of a change, because in the end the same sort of operations would still happen like they always have, we will just be delaying them via a new tree.
The worst case space usage is the same as it always has been. In the above example of 1 million snapshots, we’ll only have 100 entries, 1 for every block that was in A. The actual worst case for this is we have 1 block with 1 million entries, and then 99 entries for the remaining blocks, which is closer to ~20mib of usage in the worst case.
We can delay checksum deletion. This was an ask from Zygo, and actually this happens no matter what option we choose because we’re going to need GC for some things anyway with the drop tree+snapshot scheme. We can run the checksum deletion in this async operation and we can go as fast or as slow as we want.
We can delay free space tree updates. Instead of adding an entry every time we mess with an extent, we can essentially mass add free space to the free space tree via the un-pinning mechanism, which would cut down on the free space tree churn. Again this kinda happens with both schemes, but I’m putting it here.
Cons
Qgroups still sucks. We still have to do backref lookups to figure out who points at us and see if anything changed. However because we’re delaying all of the running of the delayed refs, we won’t have to pay this cost at transaction commit time.
ENOSPC actually gets pretty annoying with GC, which is why I’ve avoided it like the plague until now. At some point we’ll have to wait for GC to finish to see if we gained any new free space, and this will hurt us in the very full cases where we need to wait for it to finish. This will be paid in both the drop tree+snapshot tree scheme and this scheme, it will just be more pronounced in this case because the operations will be more expensive.

What do I want?

I want all the core btrfs developers to read the above thing and pick a side, or suggest alternatives. At our meeting next week I want us to decide on what we think the best way forward is, and then I’ll go down that path and see if there’s anything we didn’t think about that may change our minds.

lorddoskias commented 3 years ago

Extent tree v2 changes

Per-bg roots are now simply N global roots

The original plan was to allocate a new csum, extent, and drop root for every block group. This would make lock contention better, but at the cost at really exploding the amount of roots we would potentially have to update per transaction commit. We could solve this partly by increasing the default BG size, but this isn’t possibly for zoned block groups.

Instead we will pre-allocate N number of roots at mkfs time. This can be user tunable, but will basically pick NR_CPUS as a default. Block groups will then point at their set of csum, extent, and drop roots on creation. So if we have 4 CPU’s, we’ll have 4 csum roots, 4 extent roots, and 4 drop roots, and then block groups will point at one of these 4 ID’s for their set of global roots.

This is similar to how xfs is doing its internal split into allocation groups. One question I have is how BGs are going to be tied to their respective set of roots - perhaps a new item type would be necessary, whose body would include the 3 block pointers to respective roots? Perhaps giving example of possiible structure for those items would be useful. Do you think it will be useful to tune the global roots count after the fs is created, I envision it can be done for added complexity. I.e if you have a fs on one machine and then upgrade the CPUs would we want to scale existing fs accordingly? This might require balance in order to bring things into, well, balance :)

This solves the lock contention problem without drastically exploding the number of roots that could be modified during a transaction. This also allows us to in the future have global roots for other things. For example if we wanted to have per-subvolume csum roots we could potentially do this by adding a field to the root to point at their own personal global root id set.

Can you expand a bit more on this or it's just "thinking out loud"?

We need to keep bookended extents

Unfortunately my scheme to do overlapped extents and free space as required will have to go. The overall plan works fine, but if you through qgroups into the mix it suddenly becomes a nightmare. At that point we can’t really tell if a data extent is shared or exclusive, because there could be overlapping ranges, so we would have to go to the extent tree to check if there were overlapping ranges at every given point. We will have to keep the current scheme as it is to make qgroups not suck.

A possible future option is to post-process these extents in order to free up space, but that’ll be outside the scope of this change.

Would this post-processing require on-disk format changes if so it might make sense to think about it harder now so as down the line it can be enabled with less pain/format-breaking changes.

Open questions

The snapshot+drop root tree is a nice solution for tracking changes, because it only requires 1 tree modification per level of a shared tree, and it allows us to easily figure out what is shared where, sort of.

Drop tree pros

Easy to work out possible root references.

Single tree update when un-sharing individual blocks.

Easy to determine who is still referencing a specific block with a single lookup in the normal case.

Drop tree cons

Snapshot chaining makes things complicated. This is having subvolume A which is snapshotted by B, and then B is subsequently snapshotted by C. In this case we would think that a block in A is referenced by C, but if that block was COW’ed by B before C was created then this isn’t the case. To figure this out we would need to track the generation of every drop entry so we could determine if C actually referenced the block. This is more space expensive, and more computationally expensive.

The worst case is very space expensive. Consider 1 million snapshots of A, and we rm -rf all 1 million snapshots of A. We now have every block that was referenced in A with 1 million drop entries each. If A had 100 blocks, that’s 100 million entries, and at 16 bytes (root id and generation) that’s 1.6gib of space usage in the worst case.

Modifying the original subvolume during the same transaction as the snapshot is very tricky. We have to implement a sub-transid mechanism to figure out if a block with generation == snapshot generation is actually shared or was added after the snapshot was taken. This isn’t overly complicated in practice, but is annoying.

Chris’s proposal is to essentially dump the delayed refs tree to disk. Our most recent compromise was to do both things, to have the snapshot+drop tree and then have a garbage collection tree to handle the “cons” cases above after the fact.

However the con’s for drop tree+snapshot tree are very wonky and don’t really fit well into any particular mental model. You just have to know that these are problems and we have to code around them and hope they don’t hurt us in the long run.

To me this reads as "this needs to be very explicitly documented so that anyone who does changes to relating code should be well aware of the pitfalls and be able to mitigate them.

Alternative proposal

We do what Chris suggested early on, and simply push our delayed ref stuff to an on disk tree. This has the upside of keeping our reference counting the way it has always been, and is relatively straightforward to keep track of. It has all the downsides of our current approach, however we can make those downsides less painful.

Part of what makes our current approach so painful is that every delayed ref must be run in that transaction. Because of this we can get bad cases where we’ve completely blown out the transaction commit time because we have to run the delayed refs.

However if we simply put this work into an on-disk tree, we can post-process the extent tree updates at any point, allowing transaction commits to happen at any time.

Furthermore we can take our most expensive operations, btrfs_inc_ref() and btrfs_dec_ref() and turn them into single items in the delayed ref tree. This doesn’t mean we avoid the cost associated with btrfs_inc_ref() and btrfs_dec_ref(), but it does mean we can delay it to the point where we may have less churn on the file system in general. Meaning we will still at some point have to actually do the work of btrfs_inc_ref() and btrfs_dec_ref() in the future, but we could end up cancelling operations out and end up with less actual churn on the extent trees.

Doesn't this open a whole new can of worms, because we now need to implement logic to decide when is a good time to process the delayed refs tree and apply its changed to the extent tree. What this delayed refs tree becomes is really another type of log tree, no ? I assume its changed would need to happen within a transaction, so say in T1 (trans 1) we create a bunch of delayed refs + other operations, we write the delref tree on-disk (what drives the decision whether we choose to write the tree as-is rather than process it in this transaction i.e do we base the decision on some threshold ), without applying it to the extent tree. Some time later a T2 is about to be committed - do we keep on appending ot the delref tree or do we run everything in T2.

Another route we can go is perhahps have the cleaner thread open a transaction to reconcile extent tree/delref tree, however if in the mean time we have other operations in progress they weill naturally open a handle to thetransaction running the delref/extent tree reconciliation process, meaning they could possibly bear the cost of previous changes to the fs.

Pros

Slightly easier from a code-stand point. This would be way less drastic of a change, because in the end the same sort of operations would still happen like they always have, we will just be delaying them via a new tree.

The worst case space usage is the same as it always has been. In the above example of 1 million snapshots, we’ll only have 100 entries, 1 for every block that was in A. The actual worst case for this is we have 1 block with 1 million entries, and then 99 entries for the remaining blocks, which is closer to ~20mib of usage in the worst case.

We can delay checksum deletion. This was an ask from Zygo, and actually this happens no matter what option we choose because we’re going to need GC for some things anyway with the drop tree+snapshot scheme. We can run the checksum deletion in this async operation and we can go as fast or as slow as we want.

We can delay free space tree updates. Instead of adding an entry every time we mess with an extent, we can essentially mass add free space to the free space tree via the un-pinning mechanism, which would cut down on the free space tree churn. Again this kinda happens with both schemes, but I’m putting it here.

Cons

Qgroups still sucks. We still have to do backref lookups to figure out who points at us and see if anything changed. However because we’re delaying all of the running of the delayed refs, we won’t have to pay this cost at transaction commit time.

ENOSPC actually gets pretty annoying with GC, which is why I’ve avoided it like the plague until now. At some point we’ll have to wait for GC to finish to see if we gained any new free space, and this will hurt us in the very full cases where we need to wait for it to finish. This will be paid in both the drop tree+snapshot tree scheme and this scheme, it will just be more pronounced in this case because the operations will be more expensive.

What do I want?

I want all the core btrfs developers to read the above thing and pick a side, or suggest alternatives. At our meeting next week I want us to decide on what we think the best way forward is, and then I’ll go down that path and see if there’s anything we didn’t think about that may change our minds.

josefbacik commented 3 years ago

Instead we will pre-allocate N number of roots at mkfs time. This can be user tunable, but will basically pick NR_CPUS as a default. Block groups will then point at their set of csum, extent, and drop roots on creation. So if we have 4 CPU’s, we’ll have 4 csum roots, 4 extent roots, and 4 drop roots, and then block groups will point at one of these 4 ID’s for their set of global roots.

This is similar to how xfs is doing its internal split into allocation groups. One question I have is how BGs are going to be tied to their respective set of roots - perhaps a new item type would be necessary, whose body would include the 3 block pointers to respective roots? Perhaps giving example of possiible structure for those items would be useful. Do you think it will be useful to tune the global roots count after the fs is created, I envision it can be done for added complexity. I.e if you have a fs on one machine and then upgrade the CPUs would we want to scale existing fs accordingly? This might require balance in order to bring things into, well, balance :)

Yes so the idea is that we create N number of global roots at mkfs time, but at any point in the future you can easily add new sets. I'm going to re-use the chunk_objectid to point at the "global root id", since that is always set to BTRFS_FIRST_CHUNK_TREE_OBJECTID currently. So we allocate a block group, we pick whatever "global root id" we want next, and set it in this item. Then any bytenr that falls inside that block group just looks up the block group, gets the "global root id" from the block group, and then uses the appropriate helper, so if we want the extent root we do something like

root = btrfs_extent_root(fs_info, bytenr);

// btrfs_extent_root does
block_group = btrfs_lookup_block_group(fs_info, bytenr);
struct btrfs_key root_key = {
    .objectid = BTRFS_EXTENT_TREE_OBJECTID,
    .type = BTRFS_ROOT_ITEM_KEY,
    .offset = block_group->global_root_id,
};
return btrfs_lookup_root(fs_info, &root_key);

So if after the fact we want to add more global roots, well then that's just more global root id's we can assign when we make block groups. If we want to re-balance everything across more block groups then we can just balance to naturally spread everything across the new block groups.

This solves the lock contention problem without drastically exploding the number of roots that could be modified during a transaction. This also allows us to in the future have global roots for other things. For example if we wanted to have per-subvolume csum roots we could potentially do this by adding a field to the root to point at their own personal global root id set.

Can you expand a bit more on this or it's just "thinking out loud"?

Just thinking out loud, kdave mentioned possibly wanting to do this, having the root id thing makes it easy to make these changes in the future.

We need to keep bookended extents

Unfortunately my scheme to do overlapped extents and free space as required will have to go. The overall plan works fine, but if you through qgroups into the mix it suddenly becomes a nightmare. At that point we can’t really tell if a data extent is shared or exclusive, because there could be overlapping ranges, so we would have to go to the extent tree to check if there were overlapping ranges at every given point. We will have to keep the current scheme as it is to make qgroups not suck. A possible future option is to post-process these extents in order to free up space, but that’ll be outside the scope of this change.

Would this post-processing require on-disk format changes if so it might make sense to think about it harder now so as down the line it can be enabled with less pain/format-breaking changes.

I'm still thinking on this, it's kind of an ugly situation and would require changing the file extent items and the extent items themselves. I don't think we could even accomplish this with postprocessing without exploding metadata usage if there were snapshots, so I don't think this is a realistic option.

Open questions

The snapshot+drop root tree is a nice solution for tracking changes, because it only requires 1 tree modification per level of a shared tree, and it allows us to easily figure out what is shared where, sort of.

Drop tree pros

Easy to work out possible root references.

Single tree update when un-sharing individual blocks.

Easy to determine who is still referencing a specific block with a single lookup in the normal case.

Drop tree cons

Snapshot chaining makes things complicated. This is having subvolume A which is snapshotted by B, and then B is subsequently snapshotted by C. In this case we would think that a block in A is referenced by C, but if that block was COW’ed by B before C was created then this isn’t the case. To figure this out we would need to track the generation of every drop entry so we could determine if C actually referenced the block. This is more space expensive, and more computationally expensive.

The worst case is very space expensive. Consider 1 million snapshots of A, and we rm -rf all 1 million snapshots of A. We now have every block that was referenced in A with 1 million drop entries each. If A had 100 blocks, that’s 100 million entries, and at 16 bytes (root id and generation) that’s 1.6gib of space usage in the worst case.

Modifying the original subvolume during the same transaction as the snapshot is very tricky. We have to implement a sub-transid mechanism to figure out if a block with generation == snapshot generation is actually shared or was added after the snapshot was taken. This isn’t overly complicated in practice, but is annoying.

Chris’s proposal is to essentially dump the delayed refs tree to disk. Our most recent compromise was to do both things, to have the snapshot+drop tree and then have a garbage collection tree to handle the “cons” cases above after the fact. However the con’s for drop tree+snapshot tree are very wonky and don’t really fit well into any particular mental model. You just have to know that these are problems and we have to code around them and hope they don’t hurt us in the long run.

To me this reads as "this needs to be very explicitly documented so that anyone who does changes to relating code should be well aware of the pitfalls and be able to mitigate them.

Of course, this change would require a heavy amount of documentation and commenting everywhere so it's clear what the rules are. This is a con, since it would completely change everything about how btrfs works WRT space tracking, and would take a while for everybody to get used to it.

Alternative proposal

We do what Chris suggested early on, and simply push our delayed ref stuff to an on disk tree. This has the upside of keeping our reference counting the way it has always been, and is relatively straightforward to keep track of. It has all the downsides of our current approach, however we can make those downsides less painful. Part of what makes our current approach so painful is that every delayed ref must be run in that transaction. Because of this we can get bad cases where we’ve completely blown out the transaction commit time because we have to run the delayed refs. However if we simply put this work into an on-disk tree, we can post-process the extent tree updates at any point, allowing transaction commits to happen at any time. Furthermore we can take our most expensive operations, btrfs_inc_ref() and btrfs_dec_ref() and turn them into single items in the delayed ref tree. This doesn’t mean we avoid the cost associated with btrfs_inc_ref() and btrfs_dec_ref(), but it does mean we can delay it to the point where we may have less churn on the file system in general. Meaning we will still at some point have to actually do the work of btrfs_inc_ref() and btrfs_dec_ref() in the future, but we could end up cancelling operations out and end up with less actual churn on the extent trees.

Doesn't this open a whole new can of worms, because we now need to implement logic to decide when is a good time to process the delayed refs tree and apply its changed to the extent tree. What this delayed refs tree becomes is really another type of log tree, no ? I assume its changed would need to happen within a transaction, so say in T1 (trans 1) we create a bunch of delayed refs + other operations, we write the delref tree on-disk (what drives the decision whether we choose to write the tree as-is rather than process it in this transaction i.e do we base the decision on some threshold ), without applying it to the extent tree. Some time later a T2 is about to be committed - do we keep on appending ot the delref tree or do we run everything in T2.

Right, so it is essentially a form of log tree, but because we'll run all references through this tree we don't have any deadline of when we need to apply the changes. We could potentially have 100's of transactions of extent reference changes queued up in here, and we would just apply them in the background at our leisure. The only time we would absolutely need to process everything is at ENOSPC time.

Another route we can go is perhahps have the cleaner thread open a transaction to reconcile extent tree/delref tree, however if in the mean time we have other operations in progress they weill naturally open a handle to thetransaction running the delref/extent tree reconciliation process, meaning they could possibly bear the cost of previous changes to the fs.

That's the point of this change, to make it so there is way less operations that absolutely must be completed in the transaction that the modification occurred in. This has the drawback of echo'ing out into other transactions, so we could be doing work long after the user has stopped writing, but the changes would be discrete and thus we'd incur less latency at transaction commit time as we won't have operations that we must complete before committing the transaction, they just become their own operation inside of a trans handle.

josefbacik commented 3 years ago

I've found that this issue is getting a bit hard to tell what the actual design is. To help facilitate better discussion, I've made a branch of the documentation that I will submit (in its finalized form of course) once the code is merged. This way you can easily see the current state of the design, and then we can continue the discussions here

https://github.com/josefbacik/btrfs-dev-docs/blob/extent-tree-v2/extent-tree-v2.md

Zygo commented 3 years ago

We need to keep bookended extents

Unfortunately my scheme to do overlapped extents and free space as required will have to go. The overall plan works fine, but if you through qgroups into the mix it suddenly becomes a nightmare. At that point we can’t really tell if a data extent is shared or exclusive, because there could be overlapping ranges, so we would have to go to the extent tree to check if there were overlapping ranges at every given point. We will have to keep the current scheme as it is to make qgroups not suck.

A possible future option is to post-process these extents in order to free up space, but that’ll be outside the scope of this change.

Would this post-processing require on-disk format changes if so it might make sense to think about it harder now so as down the line it can be enabled with less pain/format-breaking changes.

I'm still thinking on this, it's kind of an ugly situation and would require changing the file extent items and the extent items themselves. I don't think we could even accomplish this with postprocessing without exploding metadata usage if there were snapshots, so I don't think this is a realistic option.

Removing bookended blocks is possible already. Dedupers on brtfs already have to deal with bookends when an extent has partially unique, partially duplicate data. The entire extent must be destroyed, so the deduper must create a temporary copy of the unique data, and then all the data in the extent is duplicate, so we can dedupe all of it (or, depending on implementer preference, don't dedupe any part of the extent--but definitely not go half way and create a bookended extent).

The same idea can be used to make a garbage-collection tool, though it would obviously be more efficient if we had an ioctl that said "copy this list of blocks into a new extent and map them on top of the old extents" in the kernel instead of doing it all from userspace.

Obviously it's less IO to be able to split up bookended extents in-place without moving their data around; however, it's usually quite desirable to defrag the data at the same time, since the bookending will be leaving tiny holes everywhere otherwise. Might as well kill two birds with one stone and make an integrated autodefrag + garbage collection tool.

Ideally we'd have a way to relocate data in a shared metadata page without making the page unshared (i.e. we figure out when we're making the same change in every snapshot, so we only have to change the roots of the subvol tree, not the leaves of the subvol tree, but we'd still have to change everything referenced by metadata page bytenr in the extent tree). That would make snapshots a lot less space-intensive if we can figure out a way to do it. There'd still be a ton of iops and I don't see a sane way to get rid of that.

josefbacik commented 3 years ago

We need to keep bookended extents

Unfortunately my scheme to do overlapped extents and free space as required will have to go. The overall plan works fine, but if you through qgroups into the mix it suddenly becomes a nightmare. At that point we can’t really tell if a data extent is shared or exclusive, because there could be overlapping ranges, so we would have to go to the extent tree to check if there were overlapping ranges at every given point. We will have to keep the current scheme as it is to make qgroups not suck.

A possible future option is to post-process these extents in order to free up space, but that’ll be outside the scope of this change.

Would this post-processing require on-disk format changes if so it might make sense to think about it harder now so as down the line it can be enabled with less pain/format-breaking changes.

I'm still thinking on this, it's kind of an ugly situation and would require changing the file extent items and the extent items themselves. I don't think we could even accomplish this with postprocessing without exploding metadata usage if there were snapshots, so I don't think this is a realistic option.

Removing bookended blocks is possible already. Dedupers on brtfs already have to deal with bookends when an extent has partially unique, partially duplicate data. The entire extent must be destroyed, so the deduper must create a temporary copy of the unique data, and then all the data in the extent is duplicate, so we can dedupe all of it (or, depending on implementer preference, don't dedupe any part of the extent--but definitely not go half way and create a bookended extent).

The same idea can be used to make a garbage-collection tool, though it would obviously be more efficient if we had an ioctl that said "copy this list of blocks into a new extent and map them on top of the old extents" in the kernel instead of doing it all from userspace.

We could do this with a snapshot aware defrag, the problem is with the current design (and extent tree v2) we simply can't get around the unsharing of a massive amount of metadata. We have 1000 snapshots and we want to defrag a single inode, say everything fits nicely onto one leaf, we're going to unshare 1000 paths to that inode to update the file extents for all paths. That just sucks.

One thing we could possibly do is de-couple the file extents from the fs tree itself. We would want to limit this to large files that span more than one leaf worth of extents, but this would solve the problem nicely. Then we can just fuck with the file extent tree for the inode, and all 1000 inodes just get the new thing via the new extent items without unsharing the entire fs trees.

This wouldn't be great for a variety of other reasons. First we'd need a way to find the file extent root, which means a massive explosion of the number of tree's we'd have to track. Maybe we do something fun like have a global root for file extent roots, but we can't get around the fact that there would be a lot of root items that need to be updated every transaction. We can't stuff the bytenr in the inode, because then we're back at the same problem, we have to update 1000 paths to point at the new root bytenr.

Maybe I'm overly fearful of updating a 1000 inodes file extent root items in a transaction, we could probably add the cost to the trans handle that does the work, but somebody is going to end up paying the cost somewhere. And all this gains us is the ability to cleanup bookend extents and have snapshot aware defrag. These things are definitely big upsides, but for a certain relatively large btrfs user we wouldn't really care, and the latency induced could be a bummer.

Clearly I'm still chewing on the problem, maybe I'll have a better plan in the future, but I think I've bitten a bit much off with the current set of extent-tree-v2 changes, and I'd rather not try to tackle something I don't have a clear path forward on.

josefbacik commented 3 years ago

Also I should note we could probably go ahead and implement snapshot aware defrag to do essentially what relocation does today. So if a user were so inclined to do snapshot-aware defrag, and was willing to eat the cost of the unsharing of the data, then they could get it done with an ioctl. But I'd want to put a giant message that says "btw this could suuuuuuuuuuuck if you have a lot of snapshots" on our tools.

Zygo commented 3 years ago

we're going to unshare 1000 paths to that inode to update the file extents for all paths

Right, but do we have to unshare all pages in all of those paths? Obviously the roots are all different so we need 1000 new pages there (and delete 1000 because they were all unshared to begin with), but if the leaves all end up identical, can we keep the leaves and some interior nodes shared, and only update say 1002 pages (one leaf that stays shared, one interior that stays shared, 1000 roots that weren't shared before and still aren't now)?

But then all the extent items might have a backref pointing to the metadata page. If these extents all use inline extent backrefs then they're OK (as long as the tree lookup still points to the new page), but if they're using shared block backrefs then we have to update them all--worst case a few hundred extent tree items and their parents.

josefbacik commented 3 years ago

we're going to unshare 1000 paths to that inode to update the file extents for all paths

Right, but do we have to unshare all pages in all of those paths? Obviously the roots are all different so we need 1000 new pages there (and delete 1000 because they were all unshared to begin with), but if the leaves all end up identical, can we keep the leaves and some interior nodes shared, and only update say 1002 pages (one leaf that stays shared, one interior that stays shared, 1000 roots that weren't shared before and still aren't now)?

Hmm we could maybe do this, it would be akin to what relocation does, so it wouldn't unshare anything that is currently shared. We may fuck up if a snapshot gets changed in the middle of our operation and we can't link in the shared block anymore, we would either simply not defrag that inode, or have to loop through again to see if there was anybody that still needed to be updated. I'll think about this some more.

Zygo commented 3 years ago

"btw this could suuuuuuuuuuuck if you have a lot of snapshots" on our tools.

bees ends up being a more or less perfect metadata unsharer in practice--dedupe touches just enough of every page in a snapshot to force unsharing on all of them. I am more than a little familiar with the worst case suck, because it's the state my larger snapshot-heavy filesystems are in all the time.

In theory we don't have to unshare anything in the dedupe case, but userspace doesn't have the interfaces needed to avoid it. I've had "figure out how to make dedupe work more like balance" on my TODO list for years. I've also had "figure out how to make defrag and garbage collection that sucks less than the current stone tools" on the same TODO list...

we can't link in the shared block anymore, we would either simply not defrag that inode

My thinking here is that userspace defrag/garbage-collect would say "please relocate [list of extent, offset, length tuples] contiguously", so the kernel would be looping through backrefs to every metadata leaf page referencing the extents. After one iteration of the loop, the kernel might run the whole loop again to see if any new references to the extents popped up (new snapshots, reflink copies, whatever). This wouldn't treat shared or unshared pages differently, so if a page gets unshared in the middle then it's handled the same way as if it was unshared from the beginning.

Zygo commented 3 years ago

I had a couple of other "while we're making big incompatible on-disk changes" requests which I wrote while you were writing "please no more this is too much churn already" ;) but well here they are:

Can we change the format of the csum tree so we can have multiple csum algorithms? Don't need online conversion yet, just need to get rid of quirks that make it unnecessarily hard.
Can we find a byte inside EXTENT_ITEMs to store the compression format? Right now if I'm reading a filesystem for dedupe, I can rule out most unique data by reading the csum tree and the extent tree; however, for short extents I don't know if the csums are compressed or uncompressed data. I have to go fetch one of the extent's EXTENT_DATA refs to see if the csums are compressed or not, and that can be an expensive backref walk. I'm thinking we have a 64-bit 'flags' field in every extent item, and we use only 3 bits of it so far. Even one bit that says "it's not uncompressed" would help, though with 8 bits in the extent flags I don't need anything in the file extent item any more.

Augusto7743 commented 2 years ago

Thanks for implementing extent tree v2. Please the new feature help to avoid problems with partial written on disk when happen OS crash or power off ? Example BTRFS not is mounted and also not is fixed using btrfs check

Opening filesystem to check... parent transid verify failed on 283934720 wanted 12284 found 11849 parent transid verify failed on 283934720 wanted 12284 found 11849 parent transid verify failed on 283934720 wanted 12284 found 11849 Ignoring transid failure ERROR: could not setup extent tree ERROR: cannot open file system

Impossible to average user access the btrfs file system.

kdave commented 2 years ago

This should not happen and is not relevant for extent tree v2, it's either a hardware problem that does not properly flush the data or a bug.

darkbasic commented 2 years ago

I'd like to ask something that it's hopefully a bit more relevant: right now enabling quotas with lots of snapshots hangs the whole system for minutes, is it a problem extent tree v2 is going to alleviate?