Check the Last-Modified header of the remote file

cavaliercoder commented 6 years ago

This PR aims to achieve the following outcomes:

grab sets If-Modified-Since for completed downloads and checks for a 304 response
grab compares Last-Modified headers with local timestamps when deciding to resume or overwrite incomplete downloads
resumed downloads are never corrupted by remote file changes

jimsmart commented 6 years ago

Hi,

I see an issue filed to do with mistakenly treating a file as incomplete when it was actually complete, and I see here you are trying to ascertain whether a file download is complete or not — and I see a 'help wanted' tag...

I understand that you currently describe grab as stateless, but of course then it gets tricky trying to actually work out what's arguably the most important state: is a file already a completed download.

Perhaps one simple piece of state might help here: perhaps whilst a file is still in the process of downloading its filename could have '.part' appended to it, then you know for sure if it's complete or not.

I think this would solve all of the issues you mention in https://github.com/cavaliercoder/grab/issues/18#issuecomment-328850476

I suspect that relying purely on local timestamps for all parts of this logic is just a bug-magnet, as you already detail.

— I'm not sure how you can solve your third bullet, 'resumed downloads are never corrupted by remote file changes', without keeping some kind of state/metadata. Just knowing the current part-downloaded file's size and timestamps doesn't seem enough to know if its the same file or not. As you state in #18, perhaps that should be the responsibility of the parent/caller. But perhaps grab should provide a hook for this.

cavaliercoder commented 6 years ago

I agree, storing state might be the only sane way to solve these issues. I could also store E-Tags in the state file to see if the remote file has changed. Thanks for your feedback.

The problem is this gives grab such a heavier footprint and starts writing files to people disks without their say so. I intended for it to be an easy way to monitor the progress of multiple parallel downloads and quickly resume if things go sideways. As you say, a hook for the caller to manage this problem might be the answer.

I did recently add Request.BeforeCopy. Could this suit your needs?

cavaliercoder commented 6 years ago

I could also add a StoreState and RestoreState that are only called if the caller explicitly wants to take advantage of this feature.

jimsmart commented 6 years ago

I think it certainly requires more thought / discussion, but I currently think hooks are probably the way to go. It's possible that the existing BeforeCopy might work for me, I've not fully investigated — but would that alone help us solve the outstanding issues to do with resuming partial downloads?

I have an intended use case, which involves caching ~200k files, plus various metadata (and refreshing/updating the cache later). But I think it should probably not be the responsibility of grab to manage all of that metadata — because: idiomacy, separation of concerns, less deps, probably not everyone's use case, etc. Plus implementing state/metadata persistence using a single file won't scale up in that scenario. (That data belongs in a database, probably owned by the parent app). Obviously, I want to build on something robust and fault tolerant, so I'm happy to do whatever it takes to ensure there are no corrupted files due to bad resumes. Hence the feedback! 👍

I still think using a specific file extension/suffix for partially downloaded files could probably be good thing — In its favour: it solves several issues easily and quite sanely, and is somewhat familiar behaviour (because many browsers do similar, although I'd be more than surprised if they didn't also store some metadata/state); On the other hand, it's not a full solution to the outstanding problems, specifically it does nothing to ensure 'resumed downloads are never corrupted by remote file changes', which, albeit an edge case, would still result in buggy behaviour. Another viewpoint/consideration is whether implementing such renaming/moving would interfere in any way with implementing 'proper' state/metadata management, or would it perhaps be helpful?

Food for thought.

cavaliercoder commented 6 years ago

I wonder if your use-case with 200k files is similar to the one that birthed this project - namely, cloning YUM package repositories?

Grab should offer some convenience by way of managing the concurrency and batching of your 200k requests, while providing thread-safe status updates. You can technically guarantee eventual consistency using checksums, or (while not guaranteed) disabling resume.

I've closed the related issues wontfix, but I'll keep this open as an experiment to see how it would look if we implemented this kind of guarantee and state persistence in grab. It certainly seems at least to be in high demand.

cavaliercoder commented 6 years ago

Relates to #21 and #18.

justinfx commented 4 years ago

FWIW I have the same requirements as @jimsmart and approached the solution using the BeforeCopy hook. I hit one issue which I just fixed via #80 (thanks for merging). So at this point I hook in before the copy and consult my local datastore for the hash from the previous download, to compare with a hash that is being delivered in the current response headers. It allows me to cancel if I conclude that I already have the complete same file. I use the AfterCopy hook to dynamically read the hash from the header and do SetCheckSum with it just before it moves to that state. I don't think grab should learn about state, but I do think the more hooks we have into the request and response lifecycle, the better. Because the user knows the details of the state and not grab. If there were earlier hooks before the file is opened, I could have solved my issue there as well without a fix in #80. Grab is a great library so far and it would remove the burden of responsibility and make it even more flexible for users to have more hooks.

cavaliergopher / grab

Check the Last-Modified header of the remote file #19