Closed groutr closed 3 weeks ago
Would you mind explaining more completely what this does? It also makes that block of code look more complex (to me), so some comments inline would be appreciated.
Python's groupby works by splitting based on a change in the value of a key function. The chosen key computes the difference between an ascending range and the block numbers (which are also necessarily sorted). If the that difference changes, we know that there was a "hole" in the block numbers (ie, we already have the block cached).
Assume a file of 5 blocks, [0, 1, 2, 3, 4]
where we have already cached block 2. The caller requests a byte range that covers blocks 0, 1, 3.
need = (i for i in block_range if i not in self.blocks) # lazily evaluated to [0, 1, 3]
groupby(enumerate(need), key=lambda x: x[0]-x[1])
Enumerating need pairs the needed blocks with an ascending range
(0, 0), (1, 1), (2, 3) # Note that the actual block numbers are the second values.
Applying the key we get values: 0, 0, 1
Groupby splits a new group on each key value change so we end up with [(0, 0), (1, 1)] and [(2, 3)] groups.
Since we enumerated need, we need to extract the second values from each group to get our blocks,
so we map itemgetter(1) over the group.
For the first group, we have blocks [0, 1], so we compute the offset for beginning of block 0 and the end of block 1.
One call is made to fetch for both blocks 0 and 1.
The second group is just a single block, 3. One call is made to fetch for that block.
I'll add an explanation via comments.
You might be interested, that something similar happens in https://github.com/fsspec/filesystem_spec/blob/master/fsspec/utils.py#L528 (where we don't have blocks, but we do have byte ranges and possible gaps).
@martindurant Thank you for pointing my attention to that function. I'm not sure if the approach here is directly applicable there. I'll have to spend more time reasoning thru it.
@martindurant Would you prefer that I instead use merge_offset_ranges
in this PR?
Would you prefer that I instead use merge_offset_ranges
Only if it's obvious how. I am not in a good position to make that choice, so long as you have considered the possibility.
@martindurant I played around with an implementation and observed that merge_offset_ranges
does not sufficiently preserve enough info to adequately update the cache metadata. Once the byte ranges are merged, we lose the info about which blocks those ranges represent and it becomes much harder to update the cache metadata.
Implement the TODO to consolidate consecutive block fetch requests into a single fetch request.
Adapted from the itertools example found here: https://docs.python.org/2.6/library/itertools.html#examples