There are currently several open issues about large files and a few ideas, so I thought I would create a single issue to collect references to these issues, summarize possible solutions, and provide links to some resources.
Open issues
Issue #36 was an issue downloading a file that evolved into concerns over hosting of large files and download times
Issue #29 was an issue with large files in the repository, and why we should avoid adding or changing any large files in the git repo itself
Issues #2 and #8 are open questions about where to host files.
(Any objections to closing these issues?)
Solutions
A few proposed solutions:
The Open Science Framework (OSF) provides hosting for large files for scientific projects. They have a web interface, but the URLs for files can also be obtained and they can be downloaded via wget or curl. We currently have several files hosted by OSF.
Synapse.org also provides hosting of data sets for open science and can provide DOI numbers for resources.
Can we pin numbers on our requirements to get a sense of how much this might cost? (If the budget is zero, that's useful to know too!)
What is the size/number of data sets we might realistically end up hosting?
What is the expiration date for this data?
How many users do we expect to download the public data sets?
What constraints do the users have (i.e., is a slower connection in return for lower cost acceptable, or does the data need to be available fast and reliably)?
Is there are preference (in terms of ease of disbursing funds or space-wise) between a physical server with storage and cloud storage?
There are some cloud workflows for avoiding large downloads as well, depending on the constraints and where we want to dedicate time. These would definitely be useful in the context of testing.
Can we develop workflows for how to use cloud storage drives to store/share databases? (Seems useful to users of both open and private databases.)
Example: develop a workflow to create a cloud drive image/snapshot creating a dataset from databases (public or private); then, develop further workflows that expect that data to be local, which is accomplished by mounting that cloud drive image/snapshot via network file storage or some other mechanism
How to leverage the cloud provider's network? For example, mounting an S3 bucket via network storage still has slow transfer rates, but the transfer all happens inside Amazon's network (probably faster, and definitely freer)
There are currently several open issues about large files and a few ideas, so I thought I would create a single issue to collect references to these issues, summarize possible solutions, and provide links to some resources.
Open issues
(Any objections to closing these issues?)
Solutions
A few proposed solutions:
Topics for Discussion
Can we pin numbers on our requirements to get a sense of how much this might cost? (If the budget is zero, that's useful to know too!)
There are some cloud workflows for avoiding large downloads as well, depending on the constraints and where we want to dedicate time. These would definitely be useful in the context of testing.