biocaddie / prototype_issues

Used to report and track bioCADDIE prototype issues
3 stars 5 forks source link

GEO indexing does not seem to be retrieving relevant datasets #152

Open ianfore opened 7 years ago

ianfore commented 7 years ago

Entered "mouse red blood cell" as query text. Clicked on "Gene expression" data type facet 18 results 17 of those are returned from ArrayExpress but represent studies from GEO. Of those 17 one was also returned as a GEO dataset (GSE24127) Of the 16 remaining 11 were false positives There are therefore 5 relevant GEO entries that were not retrieved. 1/6 = 16.7% recall

ianfore commented 7 years ago

Taking one example. E-GEOD-71288 corresponds to GEO accession GSE71288 Searching Datamed with the text "GSE71288" does not return anything. This suggests the record is not even present. Was this record added to GEO after GEO was indexed? That seems an unlikely explanation. GEO shows this record as being public on 8/1/15.

ianfore commented 7 years ago

Another example: E-GEOD-24127 corresponds to GSE24127 Searching Datamed for GSE24127 shows the following minimal metadata for one "series" and four "accessions"

EryP_E85_rep2 GEO Geo accession: GSM594011 Platform: GPL6105 Series: GSE24127

EryP_E105_rep2 GEO Geo accession: GSM594017 Platform: GPL6105 Series: GSE24127

EryP_E85A_rep2 GEO Geo accession: GSM594005 Platform: GPL6105 Series: GSE24127

EryP_E125_rep2 GEO Geo accession: GSM594023 Platform: GPL6105 Series: GSE24127

Following the link for EryP_E85_rep2 does show Organism: Mus musculus Source Name: primitive erythroid (EryP) cells Both of these are reasonable synonyms for Mouse and red blood cells.

jgrethe commented 7 years ago

This gets to the scope of what bioCADDIE is looking at. From past discussions we had limited the indexing to actual GEO datasets. However, the example above (GSE24127) seems to be from a series not associated with a dataset.

GSE71288 is also a series without a GEO Dataset. So the question is - should we also index the individual series independent of the datasets.

geo-nodataset

ianfore commented 7 years ago

We're treating the same thing in different ways depending whether we encounter it through the GEO ingest or the ArrayExpress ingest. That amounts to a) differences in what we treat as a dataset and b) the metadata that is extracted and indexed.

As far as I can see GSE24127 and GSE71288 each have a dataset https://www.ncbi.nlm.nih.gov/geo/download/?acc=GSE24127 https://www.ncbi.nlm.nih.gov/geo/download/?acc=GSE71288

jgrethe commented 7 years ago

The series have data but they do not belong to a GEO Dataset (according to what GEO classifies as a dataset). We can bring in both the datasets and the series.