Please contribute your ideas and feedback

yarikoptic commented 4 years ago

Following projects/people might either be interested or already have an "in-house" implementation which could serve as a basis:

OpenNeuro: @nellh CONP: @jbpoline Vanessasaurus of all trades: @vsoch bisweb: @bioimagesuiteweb DataLad: @mih @kyleam DANDI: @satra @mgrauer BALSA: https://balsa.wustl.edu, couldn't locate github contact, will ping

The question summary is - is there already a JS library we all could (ab)use to facilitate download from a list of urls, or what needs to be done to make it happen, and what features outlined in https://github.com/con/jsdownloader/blob/master/README.md might be impossible to achieve, and/or what considerations should be kept in mind.

shots47s commented 4 years ago

Given that you are asking from JS, this is probably always going to be highly dependent on the JS framework that you are using. I would imagine that there are already React or VUE components that do what you are suggesting,

Is something like this applicable: https://www.npmjs.com/package/electron-download-manager

nellh commented 4 years ago

The main limitation is the widely supported web APIs do not support downloading multiple files at once or starting downloads without user interaction. Workarounds exist for a small number of files but they do not scale to entire datasets (you can start a few single file downloads but more is regarded as spam by browsers). Most web apps solve this by performing the bundling server side. For OpenNeuro, we use a service worker to zip files as they are downloaded and pass the resulting stream back to the main browser thread. This does not allow you to resume the download since we need to avoid buffering the zip contents to keep the page from being killed. It also has some limitations due to 32-bit bitwise operations in JavaScript, so it really doesn't work well for large files.

The solution we're working on is using the native file system API for direct file access from the web app and background fetch to offload the actual file transfer to the browser. This sidesteps the limitations and could support everything you've proposed but it is only implemented in Chromium for now, so any implementation needs to support a fallback. Firefox has some groundwork for these features in progress but I'm not sure what the timeline is for supporting them fully.

The native file system half of this is enabled on OpenNeuro for downloads if you are using a recent Chromium-based browser and enable the flag.

@shots47s That package uses some Electron runtime specific APIs that are not available in browsers. OpenNeuro uses React but most of the download code is vanilla JavaScript so it could be extracted as a non-React-specific library.

satra commented 4 years ago

part of the thought process, especially for DANDI, is going to depend on what to download - since we will be looking at various different dataset sizes, we may have to do some notion of more efficient download pathways. i don't think it's going to be just about a js downloader, but more about what is being downloaded (a single 20G - 200G file, or a 3TB dataset, or a piece of a file via an API). all of those options are on the table at the moment.

yarikoptic commented 4 years ago

Thank you @shots47s @nellh and @satra! @nellh -- it is great to hear that you are working on this new feature! Native File System API indeed sounds like the underlying solution here. I should keep an eye on it and Background fetch, and openneuro . If in the course of your development you would make it into an independent library, that would probably be the solution I was looking for ;) I am ok to aim for the future and thus await needed support by the all mainline browsers and meanwhile recommend Chrome. With some sparse files could potentially be (ab)used for the "piece of a file" cases if local caching is a desired feature (touching upon @satra 's use case), but overall that is a bit beyond the original simple scope/desires I had in mind ;)

Beanow commented 4 years ago

Not sure I fully understood your use-case, so will add some broad comments.

If you're asking for JS specifically to target browsers. As @nellh explained, the browser sandbox does not make bulk downloads, nor local filesystem access very easy. It sounds like studying their efforts is the way to go :]

Should other runtimes be acceptable, such as nodejs, electron or using a hybrid (a portion in the browser and some type of local daemon it defers the downloading to), these limitations don't apply. Neither would you be tied to JS for such a setup, so I wonder if that's on the table or not.

To avoid bottle neck of a single "central archive server"

Should your goal of a standardization effort be more about decentralization, instead I would see if there's any way you can avoid HTTP as the transport entirely. For example tried-and-tested torrents, or up-and-coming IPFS have much better properties when it comes to reproducible downloads from varied sources.

satra commented 4 years ago

up-and-coming IPFS have much better properties

it's always nice when it comes from someone else !

vsoch commented 4 years ago

Thanks @Beanow for flying on here and giving feedback! My 0.02 is that the general goals for the tool should be discussed - as @satra and @Beanow now have mentioned, it's a very different story if we are talking about really huge data, versus smaller data (and then if something natively running in the browser vs. server side is appropriate). I can definitely imagine that kind of API that I'd want (regardless of the tool language or protocol) - something that smells a bit like the Snakemake remote API where you instantiate a provider and then issue largely the same call. If I'm guessing this is for datalad, it would be nice to have it also serve a command line client. I would suspect that now that Electron is going out of style and we have the File API in the browser, we might see something along the lines of an rsync in the browser that uses (node or webassembly?) heck, I just did a search and found such tools are being worked on!

Anyway, whatever the crew here decides to do / contribute to / test, I'd like to offer to help, testing a new protocol or browser technology would indeed be fun :) But as mentioned already, first the use case/ goal should be hardened a bit.

jbpoline commented 4 years ago

@Beanow : For the CONP use case, we would like to be able to download DataLad datasets from the browser, starting with those that are not too large (say up to a few tens of Gb) - maybe pushing the use case of very large datasets a bit down the road

Beanow commented 4 years ago

@jbpoline thank you for the context :smiley: To have the full picture, I would like to ask why specifically the browser is chosen.

Not as a direct response, but as some general thought about the browser use case. Devil's advocate hat: on

Long ago, in the early 90's, the browser was designed to "surf the web". Where the web primarily consisted of text with links (hence Hypertext Transfer Protocol) and cat pictures. Since then, webpages have become more rich and versatile, and browsers more capable, but the primary use-case is still to surf the web.

For reference, this web page currently weighs in at ~1.6MB for me. The primary strategy to recover from a failed or corrupted download, is to press F5 (retry from scratch). For 1.6MB that's a perfectly reasonable solution. With my network this takes ~3s to download, interpret and render.

If it were 200MB, you may still get away with F5 in some use cases.

A single download that's only 10's of GBs, is certainly 1 or 2 orders of magnitude past where F5 is an acceptable recovery strategy. Even for an outstanding quality, high-bandwidth connection. The time and bandwidth wasted by throwing away the previous download attempt is too great.

Basically put, we would be operating the browser far, far, outside of specs to push this into TBs territory.

The fact that this browser, which started out as a hammer, might soon be usable as a pile driver for apartment foundations. Is actually kind of awesome in it's own right! But I would be hard-pressed to say this is a "natural fit" for the browser for the foreseeable future.

It's why I can't think of any industry that currently uses the browser for this task. Your google drive / dropbox has a separate piece of software when you're looking to sync 10's of GBs. The games industry defers downloads to the likes of a Steam / Uplay launcher when some games nowadays weigh 100's of GBs each. And so on.

It's easy to make the case for it. When the cost of an error is losing hours / days of time and bandwidth. Why not invest minutes to install a task-specific tool that guarantees you have negligible loss during errors? :smile:

Perhaps I'm not clear on the rationale, but I think answering: how badly does it need to be on the browser, and what's driving that choice? Will help outline the goals and specifications for this library / tool.

Is installing other software enough of a barrier to risk this time loss? Is that because of a user-friendliness issue with the alternative? Are we working around organization IT policies making installing other software a bureaucratic nightmare?

vsoch commented 4 years ago

@Beanow I am so glad that you are here for your careful thinking!

yarikoptic commented 4 years ago

To have the full picture, I would like to ask why specifically the browser is chosen.

From my PoV: a "web browser" is omni-present on all computers, known to users, doesn't require additional installation - everything (JS libraries, images, etc) necessary to accomplish the desired mission are automatically loaded for the user. So, on the user end, having "Download" button in the web page simplifies interaction and somewhat assures compatibility across operating and even file (see below) systems.

Your google drive / dropbox has a separate piece of software ...

yes, but their web interface(s) also has a "Download" button (for a file or a folder), probably for the same reason outlined above. And fulfilling this use case is the primary goal behind this hypothetical project ;)

NB FWIW dropbox app stopped working for me on linux on btrfs file system - said unsupported, then claimed support being added but starting it resulted in the same message... Also, downloading from a list of urls in the user space is really not that tricky of a task. Could be pretty much a tiny script around wget or even git-annex/datalad sitting on top. In other words -- let's assume that we have such a solution already.

vsoch commented 4 years ago

@yarikoptic I use Dropbox! It hasn't broken... yet... :grimacing:

The browser has been html/css/js for years now, and the "something else" that is starting to take off is Web Assembly. I don't think we will have well developed APIs any time soon, but (at least in the future) I don't think that the browser will be as ill equipped for download of large files as it is now.

Beanow commented 4 years ago

yes, but their web interface(s) also has a "Download" button (for a file or a folder), probably for the same reason outlined above.

With limitations: 20GB https://help.dropbox.com/installs-integrations/sync-uploads/download-entire-folders 2GB https://support.google.com/drive/thread/13150334?hl=en (haven't seen an official entry)

Also believe this doesn't do integrity checking, resuming of partial downloads, bulk downloads (other than remotely zipping, as previously mentioned), etc. In order not to have the F5 start from scratch as recovery mode.

Yes, I think they offer such a download button for easy access without installing new software. But I think they will not offer several TB downloads where you won't lose hours/days on error. For the reasons I outlined :]

Beanow commented 4 years ago

Also, downloading from a list of urls in the user space is really not that tricky of a task.

Indeed! :smile: As opposed to the browser, there's a mountain of options that work well and were specifically designed for the task. Rsync, IPFS, git-annex/datalad, syncthing, torrents, ...

The reason being, browsers' security model use a sandbox, specifically to restrict resource usage. Resources such as your hard drive and network, which we need for this task. When we're in "user land" not having this sandbox makes such an application almost trivial by comparison. More importantly though, in "user land" we're not fighting this sandbox to make it feasible. So we can go beyond feasible and make it do the right thing ©. Which is why I think they will, for now, offer a superior experience.

Either way, I'm not trying to discourage the browser use case. Just want to make sure I outlined why it's a significantly more difficult feat to achieve, while also losing major benefits of alternatives, while the alternatives are easy and abundant.

If you believe it's worth the effort regardless, then go for it :smile: as a tech-guy it's fun to see these challenges being tackles.

yarikoptic commented 4 years ago

With limitations: 20GB https://help.dropbox.com/installs-integrations/sync-uploads/download-entire-folders 2GB https://support.google.com/drive/thread/13150334?hl=en (haven't seen an official entry)

oh -- "good" to know, thank you!

Also believe this doesn't do integrity checking, resuming of partial downloads, bulk downloads (other than remotely zipping, as previously mentioned), etc. In order not to have the F5 start from scratch as recovery mode.

well -- they under-delivered ;-) And it is not surprising since majority of their use cases indeed do not require TBs of data downloads. But it doesn't mean that it could not be possible (hence this discussion)!

Also, downloading from a list of urls in the user space is really not that tricky of a task.

Indeed! As opposed to the browser, there's a mountain of options that work well and were specifically designed for the task. Rsync, IPFS, git-annex/datalad, syncthing, torrents, ... ... Just want to make sure I outlined why it's a significantly more difficult feat to achieve, while also losing major benefits of alternatives, while the alternatives are easy and abundant.

well -- now I will take the devil's advocate hat on: all those (besides git-annex/datalad ;-) ) require initial "buy in" to provide data through their protocol, they wouldn't work with a simple list of URLs. Having said that, a tool (like git-annex) can work/could be made to work with URLs with arbitrary protocols (ipfs://, rsync://, torrent://). But git-annex + datalad do require installation, which might be cumbersome for some users and there is no convenient GUI ATM. What would be the other alternative, open source and which could be expanded with support for new protocols, tool which we could recommend users to install (very easily, cross-platform) so we could feed it with such a list of URLs and it would do the right thing?

Note that my initial use case aiming for "http{,s}" urls is to seamlessly integrate with the protocol supported by the browser(s) natively, and because majority of the existing use cases (regular web sites, S3, etc) could provide such a list. If later implementations for rsync://, ipfs:// etc would appear in the JS client side -- great! such a library could provide support.

More importantly though, in "user land" we're not fighting this sandbox to make it feasible. So we can go beyond feasible and make it do the right thing ©.

Now with my datalad hat on, I could just say "we already have a tool which can do the right thing it in the user space". But that is the solution I am looking to avoid -- demanding user-space installation. Even I really dislike and at times hate when I am requested to install yet another additional tool to download, regardless how lightweight and easy to install it is. I do not think that we would ever arrive to the "the-tool-to-rule-all-downloads" we all agree to install and use. That is why I am interested in the browser library to be omni-present for any project to adopt and any user to use without requiring a yet additional installation. It seems precedents are there, many shortcomings were identified, and new technologies being developed to not require workarounds, so overall -- should be feasible to get browser to do the right thing for a "simple" Download task.

glatard commented 4 years ago

Hi, we've been discussing this in CONP too, here is the current state of our specification: https://github.com/CONP-PCNO/conp-portal/wiki/Data-download-mechanism-from-CONP-portal @xlecours has been working on an implementation.

yarikoptic commented 4 years ago

@glatard I wondered if there were any updates on your progress?

glatard commented 4 years ago

not really for now but stay tuned, it's coming :)

con / jsdownloader

Please contribute your ideas and feedback #1