fsspec / filesystem_spec

A specification that python filesystems should adhere to.
BSD 3-Clause "New" or "Revised" License
1.04k stars 362 forks source link

Consolidate block fetch requests. #1733

Closed groutr closed 3 weeks ago

groutr commented 1 month ago

Implement the TODO to consolidate consecutive block fetch requests into a single fetch request.

Adapted from the itertools example found here: https://docs.python.org/2.6/library/itertools.html#examples

martindurant commented 1 month ago

Would you mind explaining more completely what this does? It also makes that block of code look more complex (to me), so some comments inline would be appreciated.

groutr commented 1 month ago

Python's groupby works by splitting based on a change in the value of a key function. The chosen key computes the difference between an ascending range and the block numbers (which are also necessarily sorted). If the that difference changes, we know that there was a "hole" in the block numbers (ie, we already have the block cached).

Assume a file of 5 blocks, [0, 1, 2, 3, 4] where we have already cached block 2. The caller requests a byte range that covers blocks 0, 1, 3.

need = (i for i in block_range if i not in self.blocks)   # lazily evaluated to [0, 1, 3]
groupby(enumerate(need), key=lambda x: x[0]-x[1])
    Enumerating need pairs the needed blocks with an ascending range
    (0, 0), (1, 1), (2, 3)  # Note that the actual block numbers are the second values.
    Applying the key we get values: 0, 0, 1
    Groupby splits a new group on each key value change so we end up with [(0, 0), (1, 1)] and [(2, 3)] groups.
Since we enumerated need, we need to extract the second values from each group to get our blocks, 
so we map itemgetter(1) over the group.
For the first group, we have blocks [0, 1], so we compute the offset for beginning of block 0 and the end of block 1.
One call is made to fetch for both blocks 0 and 1.

The second group is just a single block, 3. One call is made to fetch for that block.

I'll add an explanation via comments.

martindurant commented 1 month ago

You might be interested, that something similar happens in https://github.com/fsspec/filesystem_spec/blob/master/fsspec/utils.py#L528 (where we don't have blocks, but we do have byte ranges and possible gaps).

groutr commented 1 month ago

@martindurant Thank you for pointing my attention to that function. I'm not sure if the approach here is directly applicable there. I'll have to spend more time reasoning thru it.

groutr commented 4 weeks ago

@martindurant Would you prefer that I instead use merge_offset_ranges in this PR?

martindurant commented 4 weeks ago

Would you prefer that I instead use merge_offset_ranges

Only if it's obvious how. I am not in a good position to make that choice, so long as you have considered the possibility.

groutr commented 4 weeks ago

@martindurant I played around with an implementation and observed that merge_offset_ranges does not sufficiently preserve enough info to adequately update the cache metadata. Once the byte ranges are merged, we lose the info about which blocks those ranges represent and it becomes much harder to update the cache metadata.