biocaddie / prototype_issues

Used to report and track bioCADDIE prototype issues
3 stars 5 forks source link

how to handle identical datasets uploaded into multiple repositories (possible versions?) #278

Open jennielarkin opened 7 years ago

jennielarkin commented 7 years ago

I identified this problem by doing a search with limited results "hibernation ground squirrel" gave 38 results. By the way -- I loved that this worked!

However, I noticed that there were several instances of a single dataset being uploaded (with identical information) into multiple repositories. This made the count of datasets inaccurate. It also made it more difficult to figure out how many novel data sets there were (versus duplicates).

So it would be good to have some way of identifying identical data sets in independent repositories. And to also have a way of showing that information (like have one be the main listing but indicate there are additional identical entries with given identifiers at other repositories).

The Date released differs between these three entries. Are these different versions of the same dataset? (perfectly valid -- but can you indicate a version history with provenance across repositories???) why is it in three different repositories spread across 5 years???

I raise this just as a use case, in case it reflects a more general issue.

one dataset example:

  1. https://datamed.org/display-item.php?repository=0008&id=5914e0795152c67771b3a2d5&query=hibernation ground squirrel
  2. https://datamed.org/display-item.php?repository=0006&id=5913bc275152c62a9fc247bd&query=hibernation ground squirrel
  3. https://datamed.org/display-item.php?repository=0044&id=5841d9315152c649505fcab7&query=hibernation ground squirrel

1) Gene Expression data of Arctic Ground Squirrel during the multiple stages of hibernation BioProject
ID: PRJNA96231
Keywords: Transcriptome or Gene expression
Access Type: download
dateReleased: 08-03-2006
Description: Differential gene expression in a wide range of tissues including brown adipose tissue (BAT), liver, heart, hypothalamus, and skeletal muscle in hibernating arctic ground squirrels during multiple stages in torpor-arousal cycles compared to non-hibernating (post-reproductive) animals with illumina beadarray technology. Keywords: Multiple stage comparison Overall design: Arctic Ground Squirrels were sampled at four stages of hibernation: early arousal denoted as EA (1-2 hrs after Tb cross 30¡C, n=4), late arousal denoted as LA (7-8 hrs after Tb cross 30¡C, n=4), early torpor denoted as ET (10-20% of torpid episode, n=4) and late torpor denoted as LT (80-90% of torpid episode, n=5), where Tb is the body temperature and the length of torpid episode is estimated from the previous torpor bout. Post-reproductive animals denoted as PR (n=7) were used as non-hibernating control. Five tissue types: brown adipose tissue (BAT), liver, heart, hypothalamus, and skeletal muscle were hybridized on two customized 700-gene beadarray platforms: 1A and 2A on 96-sample Illumina ArrayMatrix. The data of a pilot study involving brown adipose tissue (BAT), liver, and skeletal muscle on 16-sample Illumina BeadChip denoted as 16chip are also included in this series.

2) Gene Expression data of Arctic Ground Squirrel during the multiple stages of hibernation ArrayExpress
ID: E-GEOD-5414
dateReleased: 07-01-2010
Description:
Differential gene expression in a wide range of tissues including brown adipose tissue (BAT), liver, heart, hypothalamus, and skeletal muscle in hibernating arctic ground squirrels during multiple stages in torpor-arousal cycles compared to non-hibernating (post-reproductive) animals with illumina beadarray technology. Arctic Ground Squirrels were sampled at four stages of hibernation: early arousal denoted as EA (1-2 hrs after Tb cross 30¡C, n=4), late arousal denoted as LA (7-8 hrs after Tb cross 30¡C, n=4), early torpor denoted as ET (10-20% of torpid episode, n=4) and late torpor denoted as LT (80-90% of torpid episode, n=5), where Tb is the body temperature and the length of torpid episode is estimated from the previous torpor bout. Post-reproductive animals denoted as PR (n=7) were used as non-hibernating control. Five tissue types: brown adipose tissue (BAT), liver, heart, hypothalamus, and skeletal muscle were hybridized on two customized 700-gene beadarray platforms: 1A and 2A on 96-sample Illumina ArrayMatrix. The data of a pilot study involving brown adipose tissue (BAT), liver, and skeletal muscle on 16-sample Illumina BeadChip denoted as 16chip are also included in this series

3) Gene Expression data of Arctic Ground Squirrel during the multiple stages of hibernation OmicsDI
ID: E-GEOD-5414
Date Released: 10-18-2011
Description: Differential gene expression in a wide range of tissues including brown adipose tissue (BAT), liver, heart, hypothalamus, and skeletal muscle in hibernating arctic ground squirrels during multiple stages in torpor-arousal cycles compared to non-hibernating (post-reproductive) animals with illumina beadarray technology. Arctic Ground Squirrels were sampled at four stages of hibernation: early arousal denoted as EA (1-2 hrs after Tb cross 30¡C, n=4), late arousal denoted as LA (7-8 hrs after Tb cross 30¡C, n=4), early torpor denoted as ET (10-20% of torpid episode, n=4) and late torpor denoted as LT (80-90% of torpid episode, n=5), where Tb is the body temperature and the length of torpid episode is estimated from the previous torpor bout. Post-reproductive animals denoted as PR (n=7) were used as non-hibernating control. Five tissue types: brown adipose tissue (BAT), liver, heart, hypothalamus, and skeletal muscle were hybridized on two customized 700-gene beadarray platforms: 1A and 2A on 96-sample Illumina ArrayMatrix. The data of a pilot study involving brown adipose tissue (BAT), liver, and skeletal muscle on 16-sample Illumina BeadChip denoted as 16chip are also included in this series.

jmcmurry commented 7 years ago

+1

ianfore commented 7 years ago

This general issue is well known and a high priority to address. It's refreshing to see it revived here though, he questions that occur to Jennie are natural questions that a user would have.

In this case at base there's only one dataset. The duplication is entirely down to exchange of information between different repositories. There might be slight differences in metadata recorded by ArrayExpress, OmicsDI and BioProject - but the differences are not particularly significant.

What's actually surprising is that what I would regard as the base dataset isn't returned by the Datamed search. It doesn't appear to be in Datamed at all. https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE5414

jmcmurry commented 7 years ago

If the purpose of DataMed is discovery, metadata distinctions between flavors of the record are important but secondary. On a technical level, this kind of problem is not actually very different in other indexing applications. For inspiration, two good approaches within Google and Amazon.

Google Amazon
screen shot 2017-09-15 at 11 28 57 am screen shot 2017-09-15 at 2 43 00 pm

Individual instances are: 1) wrapped with a single main landing page 2) presented as distinct but related 3) tagged with relevant metadata elements

I've spoken before about this in generalities about the identifier approaches needed to make this go smoothly, but here's how it would play out in this particular case in datamed. Note that it is still possible to hide hide or deemphasize the identifiers themselves from the user interface, except within the link addresses themselves, so I've shown it accordingly. The links themselves won't work (yet?) just take note of them. As for what the meta-landing page would be identified as, I'm agnostic. It could be any one of the IDs below, chosen to be the clique leader, or, if there is sufficient reason to do so, you could create your own identifier.

Gene Expression data of Arctic Ground Squirrel during the multiple stages of hibernation

Distribution type Date Distribution
original source data 2006 GEO
parent project record 2006 BioProject
curated source data 2010 ArrayExpress
find similar omics datasets 2011 OmicsDI
visualization and exploration of data 2012 Gene Expression Atlas
JingAn2017 commented 7 years ago

@bozyurt
Hi Burak, Could you supply the relationships between identical datasets from different repositories?

Best regards, Jing

jgrethe commented 7 years ago

Hi Jing, You can search for these in the index directly -> where there are distributions in different repositories (e.g. ArrayExpress and GEO).

accessURL: https://www.ebi.ac.uk/arrayexpress/files/E-GEOD-48937/E-GEOD-48937.raw.1.zip
storedIn: ArrayExpress
qualifier: gzip compressed
format: TXT
accessType: download
authentication: none
authorization: none
accessURL: https://www.ebi.ac.uk/arrayexpress/files/E-GEOD-48937/E-GEOD-48937.processed.1.zip
storedIn: ArrayExpress
qualifier: gzip compressed
format: TXT
accessType: download
authentication: none
authorization: none
accessURL: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE48937
**storedIn: Gene Expression Omnibus**
qualifier: not compressed
format: HTML
accessType: landing page
**primary: true**
authentication: none
authorization: none
JingAn2017 commented 7 years ago

arrayexpress bioproject geo omicsdi

Hello Jeff & Burak, On beta server, I searched dataset(E-GEOD-5414) in omicsdi , please check omicsdi.PNG, it has the primary repository: arrayexpress. And then I searched dataset (E-GEOD-5414) in arrayexpress, please check arrayexpress.PNG, it has the primary repository: geo. But I searched dataset(GSE5414) in geo, there is no data, please see geo.PNG. For bioproject(PRJNA96231), it has no relationships with other repositories.

So the question maybe is to build the relationships in es first(distinguish primary and secondary dataset), and then I get the data to mark primary and secondary on UI. Thank you.

Best regards, Jing