galaxyproject / idc

Simon's Data Club - Reference data for Galaxy servers
MIT License
9 stars 7 forks source link

curation of the data #4

Open bernt-matthias opened 5 years ago

bernt-matthias commented 5 years ago

Hi there. I'm currently discussing with our cluster admins if we can integrate the Galaxy data cache cvmfs.

They had a few questions:

  1. is there a (defined) curation process of the data and maybe if there is some form of metadata for the data sets? I'm wondering in particular if its possible to determine the source of the data (e.g. if a genome was downloaded from NCBI/UCSC/..., if its with/without the mitogenome and other contigs, which data manager version was used for the creation, ... download date).

  2. Are there methods implemented to ensure data integrity (eg checksums), in particular for data downloaded from public sources?

  3. Is there a versioning system for the data? I have seen that there is eg for NCBI taxonomy, but genomes and indices seem not to be versioned.

And for my own curiosity: the files in this repo have surprisingly little content given the amount of data in http://datacache.galaxyproject.org/managed/.

Thanks...

bgruening commented 5 years ago

is there a (defined) curation process of the data and maybe if there is some form of metadata for the data sets? I'm wondering in particular if its possible to determine the source of the data (e.g. if a genome was downloaded from NCBI/UCSC/..., if its with/without the mitogenome and other contigs, which data manager version was used for the creation, ... download date).

The idea of this repo is to have such a transparent and open curation process. A user creates a PR with a data-manager.yml file and after merge some CI will run it and deploy it to the CVMFS mirror.

Are there methods implemented to ensure data integrity (eg checksums), in particular for data downloaded from public sources?

Should be part of the data-manager imho, but upstream is usually not good at providing such checksums imho.

Is there a versioning system for the data? I have seen that there is eg for NCBI taxonomy, but genomes and indices seem not to be versioned.

That really depends on the data I guess and should be handled by the data-manager or the metadata which you add to this repo.

And for my own curiosity: the files in this repo have surprisingly little content given the amount of data in http://datacache.galaxyproject.org/managed.

That's one reason why we wanted to start the IDC, but this project is still in its very early stages :(

bwlang commented 5 years ago

Some ideas about a starting point checklist for accepting PRs

bernt-matthias commented 5 years ago

upstream source does not modify data after public release (reproducibility)

I would also allow this, but then the data needs to be versioned on our side. Good example would be ncbi blast and taxonomy databases.