HumanCellAtlas / dcp-cli

DEPRECATED - HCA Data Coordination Platform Command Line Interface
https://hca.readthedocs.io/
MIT License
6 stars 8 forks source link

Downloading resources are wasted if threads are downloading the same file #481

Open jessebrennan opened 4 years ago

jessebrennan commented 4 years ago

With the current architecture, it's possible that multiple threads are downloading the same file.

This does not affect correctness of the download because of the filestore layout. It does affect efficiency. Because of "copy forward", the same files appear in both the primary and secondary bundles. If both are adjacent in the same manifest, the duplicate download becomes much more likely.

One idea for a solution would be to keep a global table of all of the files that are currently downloading / downloaded. Threads can check this table and sleep if the files already exists.

Another idea would be to have a .tmp version of the file that exists until the download is complete. I have not thought through all of the implications of this design.