Fetch improvements - Githubissues

dmchurch commented 8 years ago

A few improvements to the fetch behavior, to minimize number of fetches (causing seek latency for local files, or excess HTTP header traffic) and to minimize memory usage for the ISO object by not caching backing sectors for file content.

barneygale commented 8 years ago

Thanks for the PR, makes a lot of sense. I'll try to review the code within a day or two!

barneygale commented 8 years ago

Apologies for the massive delay. This looks good to me, and I have a couple of minor suggestions. Would you be willing to make a few changes? Otherwise I can merge and make the changes myself.

dmchurch commented 8 years ago

Sure, I can make some changes! What did you have in mind?

I've also been doing some other work on my local repo, as I've been integrating it into some professional work I've been doing, and I'd be happy to get that sorted for upstream too once this is in.

barneygale commented 8 years ago

More contributions very much appreciated

Could you make FileSource and HTTPSource use *args in addition to **kwargs in __init__?
There's a couple bits of the fetch logic which might be wrong, depending on what we're going for
- If the end of the requested range of sectors is uncached, we pass the condition that results in fetch_needed(start_sector + fetch_sectors - need_start), but when fetch_sectors > n_sectors, we haven't checked if the sectors we're downloading to satisfy self.min_fetch are already cached, which means we could download data that's already cached
- If the end of the requested range is cached, we don't satisfy self.min_fetch. I'm not sure if this is wrong or not.

dmchurch commented 8 years ago

The reason I didn't include *args in there is because my general feeling around positional vs. keyword arguments in Python is that positional should only be used for things which are intrinsic to the behavior of the function/class (like the pathname/URL), and anything shaped like an "optional tweak" should be keyword-only, otherwise it makes the code less readable at the calling location. If this were a Python 3 library, I'd have put a * after the self argument to Source.__init__ to enforce it there, as well. If you'd prefer for people to be able to call FileSource and HTTPSource with cache_content and min_fetch as positional arguments, though, I can add *args.
I was trying to balance fetch efficiency with code simplicity and targeting the general case only, which may account for some of the behavioral oddities here:
- Yes, this can result in a small amount of duplicate fetch, but it doesn't seem like a particularly large impact, especially since each overlap would be no more than 32KB (using the default min_fetch value). It seemed to me like continuing the sector search beyond the end of the seek range to ensure that we never refetch sectors would add needless complexity to the code, especially since we'd have to extend the xrange of the sector for loop. It's a pretty rare edge case, anyway -- it only occurs when the initial part of this seek was satisfied by a prefetch and there is another fetched area very shortly past the end of this seek, and that only happens when doing out-of-order fetches. If you think it's worth it, though, I can add a check to ensure we don't overfetch on this.
- This is fine; if we run into any already-fetched area, then it must be part of a block at least min_fetch sectors large. At that point, we know we're gapfilling, and extending out to min_fetch is guaranteed not to fetch any new sectors.

barneygale commented 8 years ago

OK, that all makes sense to me.

barneygale commented 8 years ago

Thanks again for this!

barneygale / isoparser

Fetch improvements #1