dcppc / data-stewards

Questions and answers about TOPmed, GTEx, and AGR resources.
8 stars 0 forks source link

For medium to large data dumps - where to host? #7

Open owhite opened 6 years ago

owhite commented 6 years ago

We have several request for data dumps. GitHub may have size limits on the data, and there could be other reasons that it's impractical to store the data set here - but it still would be useful to have the data set versioned and documented at this system. What approach should we use?

carlkesselman commented 6 years ago

Personally, I would suggest that we have S3 space or similar object store on cloud somewhere. If it would be helpful, I could set up a simple registry of what data is there. I would also suggest that we mint identifiers (minds?) for whatever data files we place there.

I’m not sure GitHub is a good solution for data storage.

Carl

cmungall commented 6 years ago

I know it's part of the plan handed down to data stewards to duplicate on amazon and google clouds, but within GO and other projects we have been exploring some other options predating DC.

We like osf.io. OSF provides free storage and guarantees they will keep your stuff up for something like 25 years. It's super easy to distribute your data to OSF thanks to @tcb's group's CLI tool: https://github.com/dib-lab/osf-cli (it should be easy to combine with standards in the BDBag family but we haven't got round to it yet). We're currently exploring use of OSF as a sustainable distribution solution as part of the Open Biomedical Ontologies Foundry. We're using it for MONDO as the moment and are very happy with it.

In GO, we're exploring a dual solution, with S3 as primary distribution (with a cloudfront layer) and OSF for archiving.

In other projects we're also looking at git annex plus OSF or archive.org. I agree with @carlkesselman that GitHub is not a good storage solution of large files, but git can a fantastic tool for managing and versioning complex distributions of files. (A lot of people conflate git annex with git-lfs, which was IMHO a bit of a horrible experience, but git annex seems a lot better).

ianfoster commented 6 years ago

DCPPC is focused on cloud. Why would we not use S3?

jmcmurry commented 6 years ago

All, please be aware that when you comment to an issue via email, please first remove all quoted text and your email signature, especially if you do not want to be spammed.

krobasky commented 6 years ago

FYI — I’ve used a fused/mounted S3 bucket from an AWS EC instance, it will disappear during heavy compute loads, so take into consideration that S3-served data might need to be mirrored prior to computing over it; there may be faster data delivery tiers for S3 that mitigate this problem, I haven’t seen any but I could do a deeper dive on it if anybody here thought it might be helpful.

jmcherry-zz commented 6 years ago

You all decide MODs can do whatever.

Alastair-Thomson-NHLBI commented 5 years ago

So, how about we create a writeable bucket on AWS for this? Or maybe two - one for temp storage and one for persistent?

clarisca commented 5 years ago

@AlastairThomson : I think this is a great idea --in particularly for additional data sets that we may want to store in the cloud for data integration activities. Which entity would be responsible for managing this storage and deciding what data sets are registered, etc?

ashokohio commented 5 years ago

@AlastairThomson, @clarisca as it turns out we are facing temp as well as persistent storage questions on STAGE for the COPDGene image data analysis. Deep Learning on a large group of images will likely need large temp storage; at the same time, there may be many Monte Carlo type runs where the final results may have to be in persistent storage for later analysis. Performance is also an issue, so we are considering EFS on AWS, but that costs much more.