Dataset size, nb of files, etc. should be optional

CONP-PCNO / conp-portal

:bar_chart: The CONP data portal

https://portal.conp.ca/

MIT License

8 stars 24 forks source link

Dataset size, nb of files, etc. should be optional #558

Open dbujold opened 1 year ago

dbujold commented 1 year ago

My datasets constantly changes in size, number of files, and number of participants. The content is also a mix of many file types. I think these types of information should be left optional in the interface.

emmetaobrien commented 1 year ago

Longer term, our intent is to deal with the issue of changing datasets by more clearly defined version management, in which any change in any of those factors would be represented as a distinct different version of the dataset.

dbujold commented 1 year ago

I understand. So this implies that expanding dataset with frequent (daily/weekly) releases, the DATS document will need to get updated and versioned accordingly?

emmetaobrien commented 1 year ago

That would be the expectation with the current model, yes.

dbujold commented 1 year ago

I think it would be nice to have a way to support projects with rolling releases as well. Such projects sometimes want to describe their cohort and datasets content in a standardized way, without entering into the specifics of how many files, what size they are, etc.

emmetaobrien commented 1 year ago

Exactly how much data are you envisioning storing on CONP, and of what sort? Our processing involves building fixed links to every distinct file, so that needs redoing for anything that changes from release to release.

dbujold commented 1 year ago

Right now we have two cohorts of >5000 participants, with thousands of whole genomes, whole exomes, etc. But data is under controlled access, which means files wouldn't be indexed by CONP. It's the dataset provenance that we're aiming to describe, rather than its content.

bryancaron commented 1 year ago

Hi David, I was discussing briefly with Emmet this morning. Are the datasets you have in mind those from the BQC19 which we have discussed in the context of distribution through NeuroHub, or different datasets? Thanks!

dbujold commented 1 year ago

Hi Bryan, this one and others. We have a few cohorts supported in Bento currently, often in a rolling release kind of way. We prepare a DATS file to annotate the datasets, but we're not always able to provide precise details about that dataset content.