Open josefbacik opened 3 years ago
I'd like to ask something that it's hopefully a bit more relevant: right now enabling quotas with lots of snapshots hangs the whole system for minutes, is it a problem extent tree v2 is going to alleviate?
Yes, there's a lot of reasons for this, and I'm addressing all of them (hopefully?)
Thanks very much for all improvements. Not demanding. What is date for release an working extent tree v2 in kernel ? Only to understand when reformat the disk and also if current code is stable I can use for testing in an disk only used for OS files ?
The squota is not suitable for the general use case but works well for containers where the original subvolume exists for the whole time.
It looks like squota won't be the magic bullet we where hoping for. I'm curious to know if extent tree v2 is still being worked on at this point.
The squota is not suitable for the general use case but works well for containers where the original subvolume exists for the whole time.
It looks like squota won't be the magic bullet we where hoping for. I'm curious to know if extent tree v2 is still being worked on at this point.
Squotas still works in this case, it just creates this awkward thing where the usage will be attributed to the deleted subvolume and you won't be able to see that. This is the example
/some/subvolume <- 100g of usage on disk, 100g of usage according to squotas
btrfs sub snap /some/subvolume /my/snap <- 100g of usage on disk, 0g of usage according to squotas
btrfs sub delete /some/subvolume
btrfs filesystem usage / <- shows 100g of usage on disk, squotas shows 0 because /some/subvolume was deleted
This is what we're referring to. Because we're not live tracking shared accounting, and we only attribute the usage to the original root, you can end up "losing track" of your actual usage. Now of course you can du -h /my/snap and see that's where the space is, we don't lie about actual usage. But if you list the usage from squotas it's only going to tell you what is actually accounted to the quota for that subvolume.
As for extent tree v2, yes I'm actively working on it. With the scope of the project the design has had to change while I was developing it and discovering flaws in certain areas. I hope to be code complete in the next couple of months.
Existing extent tree
The existing extent tree has a lot of big advantages, but some other disadvantages.
Advantages
However there are some disadvantages that are starting to become more and more of a problem. We also have some other things that are made much worse by the fact that we didn't design a system with the set of features we have today, so I would like to take this opportunity to rethink a few things related to the extent tree in order to set ourselves up better for the future.
Problems that I want to solve
finish_ordered_extent
go up because of the threads getting held up in lock contention on the csum tree. Latencies infinish_ordered_extent
can cause latencies on anything that needs to wait for ordered extents, like the ENOSPC flushing code or fsync.Uh, so how does changing the extent tree fix all these problems?
Firstly, it doesn't, at least not all of the problems. I'm wrapping all of these changes under the label "extent tree v2" because it's the biggest part, but it's actually going to be a series of on disk changes. However I do not want to do these things piecemeal because it'll hamper adoption. There needs to be a clear line in the sand where you choose "extent tree v2" and you get all these disk format changes moving forward. This will drastically simplify our testing matrix, as right now we have a lot of "features" that require toggling on and thus we have a lot of different features that can be mixed and matched. This will be a lot of changes, but it will make our lives easier going forward. My plan is to make the following specific changes
This all sounds great, are there any downsides?
Scrub
The biggest downside that I can think of right now is scrub. Right now scrub is relatively straightforward, it just marks a block group as read only, finds all the extents in that block group, reads them because you can find the owner easily, and bam you're done. With the extent tree reference counting no longer including information about cowonly trees and non-shared fs blocks we'd have to read all of the trees by searching down them. This means that we would have to keep a cache of fs tree's blocks that we've read in order to make sure we don't scrub the same tree's over and over again. This also makes it tricky because we can no longer just mark a block group read only while we do our work.
This isn't a huge drawback, data would still work the same, and we can simply search the commit roots and skip any blocks that are younger than the generation that we started the scrub with. It will be more memory to track where we've been, but the overall complexity should be relatively minimal.
Backreferences
We would no longer be able to lookup who owns arbitrary metadata blocks. I don't believe this to be a big issue because
The specifics
A block group tree
As indicated in a comment below, our block group items are spread all over the disk and makes mount times a big problem with really large drives. Fix this by making a tree to hold all the block group items. That's it, its pretty straightforward, no other real changes here.
Per-block group trees
This will be easy enough, we know the logical offset of the bytenr we care about, we can find the block group, and thus will be able to lookup the corresponding per-bg root for whatever operation we're doing.
Track only shared blocks and data in the extent tree - CHANGING DRASTICALLY, SEE COMMENT ABOUT DROP TREES
This is the big scary thing. You can find a description of how references currently work here and here. I'm not proposing changing the items as they stand right now, simply the rules around how we change them.
We no longer add extent references for cowonly trees. This is simple and straightforward, we just don't update the extent tree with those references. These blocks only get a ref added on allocation and it removed on deletion, there's no special rules for them.
We no longer add extent references for non-shared fs tree blocks. This is where things get a little tricky. A non-snapshotted fs tree will have 0 entries for its metadata in the extent tree. In practice a non-snapshotted fs tree acts the same as a cowonly tree, we add a ref on allocation, delete it on deletion. The trick comes into play when we snapshot. The same rules will apply as they always have, with a slight change if we are COW'ing down from the owner root.
We will have a normal reference at everything at level 1 for A' but not for A. We'll cover two full examples, first cow'ing from A and then A', then the reverse.
COW from A
COW from A'
And now for the other way, starting with a cow down from A'
COW from A'
COW from A
The implied reference from the original owner is somewhat tricky, so the logic in update_for_cow() would need to be updated to account for these rules, which are simply
Data extent references
Data extent references need to continue to work as before, as we have more complicated operations we can do with data, such as clone. The only change here is we no longer do bookend extents. Currently the way we handle writing to the middle of an existing file extent is this (this is the normal non-compressed/encrypted case)
The space that is no longer referenced that exists in the slot where the new file extent was placed is now wasted, as it will not be free'd until the left and right side of the extents are eventually freed.
The new scheme will be the following.
The space in the area that the new file extent replaces will be freed to be re-allocated again.
The compression case will remain the same, as we have to have the entire compressed extent to extract the area the file extent points to, but this isn't as bad because we limit our compressed extent sizes to 128k, thus we can only waste 128k-4k worth of space per extent at any given time because of bookend extents, compared to 128m-4k worth of wasted space per extent with the normal case.
Stripe tree - PROBABLY NOT GOING TO DO THIS
EDIT: What I want to accomplish and what @morbidrsa want to accomplish are slightly different things, and trying to combine them will remove flexibility for both of us. I'm still going to tackle relocation, but this idea is being set aside.
This is fairly straightforward, it'll track the phyisical location of the logical address space. Traditionally this was just some math, we had a chunk that mapped the physical offset of a block group, and so we would just take the offset from logical address from the block group and use that offset off of the physical start of the block group.
We would replace this with a stripe tree that actually tracked physical locations for the logical offset, so it could be any arbitrary device and physical offset within that device. This would have the following items, again blatantly ripped off from @morbidrsa with some modifications for my uses as well
Relocation - CHANGING FROM MY ORIGINAL IDEA
EDIT: Since I'm not doing the stripe tree I want to handle this in a different way. My plan is to do something sort of like what is described below, but instead make a new REMAP tree. If we relocate a block group we will set a REMAP flag on it's block group flags (maybe, I have to see if I can actually set the flags to something other than data type), and then populate the REMAP tree with where I've relocated the extents inside the block group. On mount this gets loaded up and we will translate any IO to the new logical offset where the extent resides. Once all of the extents have been freed from the block group the remap items will be deleted and the block group itself will be deleted. Of course the chunk will have been reclaimed by the time all of the blocks are remapped in the block group, so the space will be available, just the accounting will be removed later.
The new relocation behavior would be to migrate block groups as normal, but instead of walking the extent tree we would walk the stripe tree finding all stripes that exist in our logical space. We would then go find gaps in other logical areas that could host our stripe, read in the bytes from a btrfs_stripe_extent, and then write them to a new stripe and update the on disk stripe for the extent. We would keep track of the in memory mapping in an extent_map, so the operation would look something like this
Direct and shared bytes tracking per root - PROBABLY NOT GOING TO DO THIS
EDIT: We still have to do lookups after the fact to figure out if a shared extent went from N to 1 references. And we can likely accomplish this behavior with the current qgroup code, just would require some changes to runtime tracking and we don't need to do it on disk.
There is one tricky problem with qgroups, and that is the conversion of a data extent from shared to exclusive. This is tricky because it requires a full backref lookup everytime we modify an extent so that we can determine if we need to convert the file extent bytes from shared to exclusive. There is not much that can be done about this problem unfortunately, however we can make the intermediate tracking much simpler by storing the current shared and exclusive counts in the root items themselves. This would work like the following
1) Every metadata allocation is exclusive, automatically added to the root when it happens. 2) At snapshot time we do the following
3) At COW time if we're shared we subtract our ->shared, the ->exclusive gets changed when we allocate a new block for the cow. 4) If the reference count went to 1 we know the root that points at us, or we have a fullbackref. If we have the root we go ahead and convert right there. If we have a fullbackref we mark the block to be looked up.
This will reduce the complexity of tracking these counters across the board, and reduce the amount of backref lookups we have to do.