Resumable downloads - Githubissues

rvagg commented 1 year ago

We currently have no option to restart a download, which makes lassie pretty fussy and problematic for large downloads. If you fail, you have to start from scratch. At least with Kubo, you have the data in a blockstore so it can resume from there.

Challenges to be solved:

If you "resume" from an existing CAR, do you have to run a traversal over it to verify that the CAR DAG it has is correct up to the point that it ends (presumably prematurely)?
Can you "resume with bundle of blocks" where you supply a CAR (or multiple?) that have blocks that may be needed in your traversal, but the output CAR is still new?
What do we do about HTTP retrievals in this case since we have no "I already have this" facility, do we just document this behaviour and suggest removing the HTTP retriever?

As an experiment I've been trying to download a copy of wikipedia (bafybeiaysi4s6lnjev27ln5icwm6tueaw2vdykrtjkwiphwekaywqhcjze) and can't get more than ~500Mb in with lassie before I get timeouts or other errors and I have no way of resuming. Kubo gets much further although it slows to a crawl for me at a certain point, but at least I know I can cancel it and start again and it'll have what it already fetched in its blockstore.

There's a general problem set of "large data" that I don't think lassie is up to the challenge of solving yet.

SgtPooki commented 1 year ago

I just started fetching the .zim file used for that wikipedia root with Lassie (./lassie fetch bafybeibkzwf3ffl44yfej6ak44i7aly7rb4udhz5taskreec7qemmw5jiu). It passed 500Mb for me only some minor griping (multiple intermittent error messages in console: 2023-06-12T18:47:39.837-0700 ERROR dt_graphsync graphsync/graphsync.go:203 normal shutdown of state machine)

It's going ~almost~ half as fast as the only web2 mirror I found hosting that file: https://mirror.netcologne.de/kiwix/zim/wikipedia/wikipedia_en_all_maxi_2021-02.zim

-rw-r--r--   1 sgtpooki  staff   3.6G Jun 12 19:01 bafybeibkzwf3ffl44yfej6ak44i7aly7rb4udhz5taskreec7qemmw5jiu.car

vs

The mirror download was started at 6:43pm PST, lassie at 6:46pm PST.

rvagg commented 1 year ago

What's happening with the graphsync errors is that it's attempting multiple protocols but eventually giving up on ones that aren't yielding results - because this content is stored on multiple filecoin providers it's trying each one of them at the same time as fetching it from bitswap, but as they all fail for various reasons it leaves only bitswap. But I keep on getting context cancelled after some period of time on large downloads from the bitswap one too, regardless of --global-timeout and --provider-timeout values; I haven't worked out what's going on there yet.

rvagg commented 1 year ago

@hannahhoward had the idea of a --blockstore flag for lassie fetch, I imagine that when you use this mode, it doesn't bother trying to do the nicely-ordered CAR thing and will take an existing CAR (if it exists) under the name it's using ({cid}.car or whatever -o is specified as) and uses that as the LinkSystem to start from, so if blocks exist in it then it should skip over them in Graphsync and Bitswap traversals. For HTTP it'll have to re-fetch them but it shouldn't bother putting them into the CAR output. We have all the mechanics for this internally so it really shouldn't be hard to do. I think this is probably the easiest path to some level of resilience. I want to recover from fatal fetches without starting from scratch, especially when I have a multi-Gb file sitting in front of me (I'm experiencing this today).

filecoin-project / lassie

Resumable downloads #274