btrfs / btrfs-todo

An issues only repo to organize our TODO items
21 stars 2 forks source link

Logical mapping tree feature (relocation rework) #54

Open josefbacik opened 6 months ago

josefbacik commented 6 months ago

Problem

Our current relocation mechanism is quite complicated and heavy handed. It creates a variety of issues for us and as we rely on it more and more it needs to be replaced with something more modern.

I want to replace this with a new incompat feature to make it a lot simpler, and to enable changes for extent tree v2.

Design

First is a new tree, which will contain logical to logical mappings. That is something akin to this

struct btrfs_key remap_key = {
    .objectid = offset,
    .type = BTRFS_REMAP_KEY_ITEM,
    .offset = len,
};

struct btrfs_remap_item {
    u64 new_offset;
};

The tree would be populated with these objects, and would remap a given logical offset to a new offset. Block groups that have been relocated would be marked with a flag indicating that they're remapped, and any access to their offsets would result in a lookup in the remap tree to find the new offset.

Relocation would now do the following.

  1. Mark the block group read only.
  2. Walk the free space tree for the block groups range, using any holes in the free space tree as ranges that need to be relocated. This part is because in the future not all metadata will be tracked in the extent tree, and we will be able to relocate larger chunks of extents instead of just individual areas.
  3. Loop through the chunks of data, allocating new regions and copying the data into the new regions, insert a btrfs_remap_item into the remap_tree for the new range.
  4. Once this is complete mark the block group as remapp'ed, delete the block group and free the underlying device extents.
  5. As the ranges in the original block group are free'd, block_group->used will be dropped on the original block group as well as the actual block group where the extents now exist. Once a block_group->used hits 0 we then can go and remove all of the remap items for the range of that block group.

There are a few tricks here.

  1. BTRFS_BLOCK_GROUP_SYSTEM wouldn't be able to be remapped. We need to be able to bootstrap the system, so we would have to maintain old relocation for this. That is fine because for cowonly tree's relocation is fine, we walk the tree and cow the blocks in it. We would need to update relocation to not do the reloc root in this case, but that's about it.
  2. BTRFS_BLOCK_GROUP_REMAP. Again, we need to be able to read things, so we can't have the remap tree remapped. We can't put the remap tree in BTRFS_BLOCK_GROUP_SYSTEM because we stuff those in the super block, so we're limited to the size of our total SYSTEM area. We would need to create a new block group type that would contain the remap tree, and again if we ever wanted to relocate those block groups we'd have to use the same old-style relocation as above.

This gives us lots of benefits.

  1. The code is infinitely simpler.
  2. We no longer have this delayed refs explosion.
  3. We can relocate fragmented file systems. If we have a 1gig data extent we can carve it up into 4k chunks if we want and put them all over the place.
  4. Now that we don't have to worry about our data extent size, we could drastically increase our maximum extent size to whatever we want. Currently we limit it to 256mib because of relocation, we could no limit, which would make things like NOCOW prealloc files for VM images much faster and more efficient.
  5. Allows me to remove cowonly trees from the extent tree, drastically reducing our extent tree size.