Select DOI-providing data-sharing repositor(ies) to integrate, and which to integrate first

MiguelRodo commented 2 years ago

Apparently, FigShare can be free (for files <20GB), as can Zenodo. Dryad is not.

see this blog.

MiguelRodo commented 2 years ago

Links to #83 #23 #82 and that WikiSATVI issue about raw data

MiguelRodo commented 1 year ago

FigShare
- Pros
- UCT's data repo
- Can request more data
- Can link to an organisation
- Cons
- Seems the most expensive
- Limited to 20GB as a student
- The R package is no longer maintained
- Do you need to be on the UCT network to upload to it?

I think FigShare is just out for now. It's actually a disadvantage to use UCT's ZivaHub, as you have to be connected to the UCT network to access it, making collaboration harder.

FlowRepository

Pros
- Free
Cons
- Limit of being private for one year
- No R package
- FlowRepositoryR was removed from BioConductor and was last updated 8 years ago, but it seems like it was still working (GitHub issue answered end of 2021). However, FlowRepository itself is changing its backend due to spammed requests and there is no API available for flowRepositoryR to plug into.

IMMPort

Pros
- Handles many datasets
Cons
- I think it's not free, or perhaps it's only for data funded by particular organisations
- Somewhat onerous to submit to

In general, I think the first one we'd want to try work with is a generalist repository. We can then see how it works, and go from there.

So here is Zenodo vs OSF:

Zenodo vs OSF

Zenodo
- Pros
- Free
- Can download directly from open access files
- Can create a community (e.g. SATVI)
  - Can enforce then that naming conventions are followed
- Can add to multiple communities
- 50GB limit per component means it's easier to create enough components for massive datasets
- Max 100 files per record (can then just zip them) (https://www.openaire.eu/zenodo-guide)
- Can version releases, instead of simply versioning individual files
- Cons
- 50GB limit per dataset
- Required for upload:
  - Title
  - Authors
  - Description
  - License
- Can't download many files at once from GUI (but can from within R, as long as they're open access)
- Cannot download closed-access files automatically
OSF
- Pros
- Has a (seemingly very good) R package
- Links to cloud storage providers
- Easy to upload data
- Can create a branching structure for all SATVI projects
- Can download closed-access files automatically
- Can download entire multiple folders
- Cons
- Does not link to Google shared drives
- Web interface is intimidating and not pretty
- 5GB/component limit for private projects (50GB/component for public projects)
- osfr package failed to download when run as part of rmarkdown::render (in DataTidyACSAntibody) (but did work when run interactively)
Non-differentiating factors
- Have an R package
DataVerse
- Has a 1TB/user limit
Discussion
- Zenodo is intended as a data repository whereas OSF, due to the 5GB limit, is more intended to work with data repositories.
- Zenodo does not group multiple resources together very well.
- The Zenodo R package seems a bit harder to use.
- Zenodo would have to be handled like GitHub ito searching your uploads.
- Zenodo

MiguelRodo commented 1 year ago

Review

This review points out that Zenodo has a clearer interface but doesn't support sub-directories within resources, as compared to OSF:

[Dmytro Kryvokhyzha](https://github.com/)
Dmytro Kryvokhyzha

Bioinformatics & Genomics Scientist

[About](https://github.com/) [Contact](https://github.com/contact) [Publications ](https://scholar.google.se/citations?user=99unghgAAAAJ&hl=en) [CV](https://github.com/cv) [Blog](https://github.com/blog)

The best free Research Data Repository
I compare the most popular repository for research data: Dryad, Zenodo, FigShare, Open Science Framework, and Mendeley.

You need to deposit your research data to a repository and you are lost in options. I have been in the same situation recently.

If your data is of specific type then the choice is obvious. You deposit that data to a data-type specific repository. For example, nucleic acid sequence data need to be uploaded to the [Sequence Read Archive (SRA)](https://www.ncbi.nlm.nih.gov/sra). Scripts and programs should be deposited to [GitHub](https://github.com/evodify) or similar resource with a [version control system](https://git-scm.com/book/en/v1/Getting-Started-About-Version-Control). Usually, you need to make your best to use these repositories because this will increase the chance of your data to be found by other researchers. Here is an extensive [list of data-type specific repositories](https://www.nature.com/sdata/policies/repositories).

But if you also have some non-standard data formats, you need to use a generalist repository. The most popular ones are Dryad, FigShare, and Zenodo. These were the repositories I found first. Later, I also discovered the Open Science Framework (OSF) and it became my number one research data repository.

My key criteria when I was looking for the best repository for my scientific data were:

Free
DOI
Ability to update files
Directory structure
Publishing in open-access journals already costs a fortunate, so I wanted to use a free repository to avoid additional spending. A digital object identifier (DOI) is probably a must for any publication. It is especially useful if you publish a dataset without a link to any paper. A DOI makes it easier to cite the dataset. I also would like to have an option to edit or update the data after the initial deposit. Mistakes are always possible and it is better to be able to correct them. The amount of data grows enormously and usually my projects have many files structured in directories. I would like to keep this directories order in my repositories too. The OSF repository meets these requirements the best.

Let me briefly summarize my option on each of the repositories I tried.

Dryad
Dryad research data repository

Dryad is the most popular research data repository. It is recommended by many journals. I used it to publish [the supplementary data for my Molecular Ecology paper](https://doi.org/10.5061/dryad.q83pt). By publishing in Molecular Ecology, you get a link to deposit your data to Dryad for free.

However, it is not a free repository. You need to pay $120 for a submission of up to 20GB, and +$50 for each additional 10GB. On the other hand, such a business model guarantees long term existence of this repository.

I like it for its simple and easy to use interface. Uploading the data is very simple and fast. You get a DOI for your data and some simple metrics such as a number of page views and downloads. But you cannot edit anything after the submission. There is no directory structure support, so you can upload a directory only as an archive file.

Pros:

popular
simple
DOI
metrics
Cons:

non-free
no edit/update after the submission
no directory structure support
not optimized for downloading many files at once
FigShare
FigShare research results repository

FigShare is a great repository for visual content. It shows a preview of every file. If I recall correctly, this was the initial purpose of FigShare. Now, you can also use FigShare to upload any file types.

There is no limit on files size if you make them public. You can modify your files after the publication with a version control system.

I think FigShare should be used only to share posters, slides, and figures. It is not convenient for sharing dozens of files. You can use collections and project, to unite many files. But there is no easy way to download many files. The interface of the repository is also not simple. You often need to navigate several windows to access a file.

Pros:

popular
free
DOI
unlimited space
image preview
Cons:

optimized only for single visual file sharing
complicated to use
no directory structure support
not optimized for downloading many files at once
Zenodo
Zenodo research data repository

Zenodo is good in many regards. It is free. There is a version control system. The DOI is provided. You can meter page views and downloads.

The file size limit is 50GB per dataset but you can have an unlimited number of datasets.

However, you cannot create folders with files. You can upload each folder as a separate dataset or compress each folder into an archive and upload it. But this is not an ideal solution.

Pros:

popular
free
DOI
simple interface
version control system
Cons:

no directory structure support
not optimized for downloading many files at once
50GB limit per dataset
Open Science Framework
Open Science Framework repository

OSF is my favorite repository to store my research data. It is surprisingly not very popular. It took a while until I found it. I believe its popularity will grow as it is an amazing repository for scientific data.

OSF is free. You get a DOI for your repository. There is a version control system. It supports directory structure in repositories. You can update your files after the publication and the history of the repository is tracked.

The default file size limit is 5 GB. But you can extend this limit with [add-ons](https://help.osf.io/hc/en-us/articles/360019737894-FAQs#what-is-the-cap-on-data-per-user-or-per-project).

The OSF interface is more advanced than in other repositories. I consider it an advantage. But it is little too advanced and some user may find it difficult to use. So, I will still list it in the cons.

Pros:

free
DOI
version control system
supports directory structure
optimized for downloading many files at once
Cons:

not popular
advanced interface
5GB limit per file (no number of files limit)
I have not explored the funding of other repositories but OSF is secured by funding for 50+ years. The chance it will disappear is very small.

Mendeley
Mendeley repository for scientific data

Mendeley is known as a digital library app with great reference tools. Recently, it also launched the Mendeley Data service. I found out about this Mendeley Data repository while writing this blog post.

It is a simple repository. If you already use Mendeley and you do not want to bother with other options, go ahead and use Mendeley Data.

You can see its pros and cons below. I only would like to emphasize that there is a moderation step to publish your data. So, be ready to wait sometime before your data becomes public.

Pros:

popular
simple
DOI
supports directory structure
optimized for downloading all files at once
Cons:

no version control system
moderation
10 GB per dataset
Summary
This is not a comprehensive review. I just evaluate these repositories from my requirements. For example, you may need to check the funding of free repositories to make sure they won’t disappear soon. I also did not pay attention to license types these repositories support because I usually release my data into the public domain anyway.

If you think there is something crucial I missed, please [let me know](mailto:dmytro.kryvokhyzha@evobio.eu) and I will add it.

Written on August 27, 2019
[Disclaimer](https://github.com/disclaimer/) [Privacy Policy](https://github.com/privacy/)

MiguelRodo commented 1 year ago

I've started using OSF for ACSAntibody. It seems to work well. I do think that it's a good point that it does support a directory structure, as well as project sub-components. This integrates much better with how I viewed things in that they should be broken up into multiple sub-projects.

MiguelRodo commented 1 year ago

Note that POPI implies that any data on the cloud are de-identified already (source):

MiguelRodo commented 1 year ago

Basically done.

SATVILab / projr