danobi / btrfs-walk

GNU General Public License v2.0
9 stars 1 forks source link

Unable to understand the logic behind logical to physical address mapping #1

Open SaranyaGit12 opened 8 months ago

SaranyaGit12 commented 8 months ago

Unable to understand the logic behind logical to physical address translation/mapping.Is the logical to physical address mapping is same for all the RAID configurations used by the BTRFS? Is there any particular documentation or hint that you have referenced ?

danobi commented 8 months ago

Hi @SaranyaGit12 , it's been a while since I studied btrfs so I've basically forgotten it all. At the time I did write this series of articles though:

https://dxuuu.xyz/btrfs-internals.html https://dxuuu.xyz/btrfs-internals-2.html https://dxuuu.xyz/btrfs-internals-3.html https://dxuuu.xyz/btrfs-internals-4.html https://dxuuu.xyz/btrfs-internals-4.html

So maybe that'll help you. Other than that, I'm of no use here

Zer0- commented 1 month ago

For anyone else confused, the only way I've figured it out is by reading the code of btrfs-progs. I'll try to highlight the important bits.

ctree.h defines all the important data structures, and the nice thing about C is that the memory is laid out in the same way as the data on disk.

The kind of main entry point I was looking at is the function __open_ctree_fd in disk-io.c which itself calls btrfs_setup_chunk_tree_and_device_map which calls btrfs_read_chunk_tree which calls read_one_chunk and that is where the first half of the problem is solved. Reading the chunk tree is a very important first step. From my (limited) understanding, the chunk_root field in the superblock is actually a physical address, and it doesn't matter which device you are looking at in your array, they will all have a chunk tree.

First step: build a mapping tree

So the first step is to pick a device, read the superblock and read the chunk tree located at chunk_root. No need for address translation yet.

The chunk tree has device nodes and chunk nodes, and the chunk nodes contain the information needed to map the tree. If you look at the important function read_one_chunk it takes the information in btrfs_chunk (which is an item in the chunk_tree leaf node), puts it into another struct, and then saves a cache_extent structure into a red-black tree (which can be thought of as an ordered-set).

struct cache_extent {
    struct rb_node rb_node;
    u64 objectid;
    u64 start;
    u64 size;
};

// snippet from read_one_chunk:
    map = kmalloc(btrfs_map_lookup_size(num_stripes), GFP_NOFS);
    if (!map)
        return -ENOMEM;

    map->ce.start = logical;
    map->ce.size = length;
    map->num_stripes = num_stripes;
    map->io_width = btrfs_chunk_io_width(leaf, chunk);
    map->io_align = btrfs_chunk_io_align(leaf, chunk);
    map->sector_size = btrfs_chunk_sector_size(leaf, chunk);
    map->stripe_len = btrfs_chunk_stripe_len(leaf, chunk);
    map->type = btrfs_chunk_type(leaf, chunk);
    map->sub_stripes = btrfs_chunk_sub_stripes(leaf, chunk);

    for (i = 0; i < num_stripes; i++) {
        map->stripes[i].physical =
            btrfs_stripe_offset_nr(leaf, chunk, i);
        devid = btrfs_stripe_devid_nr(leaf, chunk, i);
        read_extent_buffer(leaf, uuid, (unsigned long)
                   btrfs_stripe_dev_uuid_nr(chunk, i),
                   BTRFS_UUID_SIZE);
        map->stripes[i].dev = btrfs_find_device(fs_info, devid, uuid,
                            NULL);
        if (!map->stripes[i].dev) {
            map->stripes[i].dev = fill_missing_device(devid, uuid);
            printf("warning, device %llu is missing\n",
                   (unsigned long long)devid);
            list_add(&map->stripes[i].dev->dev_list,
                 &fs_info->fs_devices->devices);
            fs_info->fs_devices->missing_devices++;
        }

    }
    ret = insert_cache_extent(&map_tree->cache_tree, &map->ce);

Notice that in read_one_chunk the variable map is allocated but nothing is ever done with it. Only the cache_extent ce is put into map_tree (which is the red black tree data structure that is queried later during lookup). This was mega confusing for me, but because of how C works, you can actually get back to a parent container from a child member using the container_of macro. We'll see this in use in a bit.

Second step: actually map the logical address

This happens in the __btrfs_map_block function. We'll need to fully understand it to understand mapping, because that's where all the address calculations are carried out. Essentially it returns a list of stripes, a stripe being a pair of (device, physical address).

__btrfs_map_block starts off by querying the tree we built in step one for dummy cache_extent where start is your logical address, and size is one:

    ce = search_cache_extent(&map_tree->cache_tree, logical);
struct cache_extent *search_cache_extent(struct cache_tree *tree, u64 start)
{
    struct rb_node *next;
    struct rb_node *node;
    struct cache_extent *entry;
    struct cache_extent_search_range range;

    range.start = start;
    range.size = 1;
    node = rb_search(&tree->root, &range, cache_tree_comp_range, &next);
    if (!node)
        node = next;
    if (!node)
        return NULL;

    entry = rb_entry(node, struct cache_extent, rb_node);
    return entry;
}

You can implement this functionality with any binary search tree. All you need is a comparison operator for cache_extents (defined in extent_cache.c and if the tree doesn't have your node (because its a fake node that was created just to search the tree), it will return the next greatest node.

Then in the most unexpected turn of events ever, it reads backwards in memory to go to the loopkup_map representing a chunk that you created in step one:

    map = container_of(ce, struct map_lookup, ce);

I assumed for way too long that this does something else

And then it simply does a bunch of math using map to get to a stripe, taking into consideration which RAID setup you have.