AlexsLemonade / refinebio

Refine.bio harmonizes petabytes of publicly available biological data into ready-to-use datasets for cancer researchers and AI/ML scientists.
https://www.refine.bio/
Other
129 stars 19 forks source link

Add NCBI GEO surveyor/downloader #161

Closed kurtwheeler closed 6 years ago

kurtwheeler commented 6 years ago

Context

Since we have started this project, Array Express has stopped replicating data from NCBI GEO. Therefore unless we survey/download directly from there we won't get ALL THE DATA. Additionally some of the datasets Array Express HAS replicated don't have all the metadata or raw data that we want.

Problem or idea

Add another surveyor/downloader combo to use NCBI GEO directly. We don't need a new processor though because we aren't going to get any kinds of data that we don't already have processors for.

Solution or next step

To resolve this issue we will need a PR which has:

This issue replaces https://github.com/AlexsLemonade/refinebio/issues/33 and https://github.com/AlexsLemonade/refinebio/issues/114 because it involves the same work, isn't constrained to specific situations, and is better specced out.

jaclyn-taroni commented 6 years ago

A mechanism by which we can ensure that we're using the right surveyor for each accession. I believe this can be inferred from the accession alone because any accession starting with E-GEOD is from GEO.

I agree. They also should have a Comment[SecondaryAccession] field in the idf.txt file and the value will start with GSE.

Miserlou commented 6 years ago

GEO (specifically, GEO DataSets, which is what I think we're talking about here) has lots of different types of data. Do we want all of it, or a subset?

This ticket would be greatly assisted by some accession codes/URLs of things that we want from GEO that we can't get from elsewhere.

cgreene commented 6 years ago

I think we actually want GEO Series (the DataSets used to the a curated subset of GEO, but I think they stopped adding new datasets ~2011 or 2012 if I recall correctly).

jaclyn-taroni commented 6 years ago

I agree about the Series. We do not want the curated GEO DataSets.

GEO (specifically, GEO DataSets, which is what I think we're talking about here) has lots of different types of data. Do we want all of it, or a subset?

Is the curated vs. submitter-supplied subset what you're asking about here @Miserlou or are you thinking more in terms of processor (i.e., microarray manufacturer)?

Miserlou commented 6 years ago

I am thinking: you need to tell me what you want so that I can go and get it.

For instance - GEO has these file formats:

What are these? Which of these do we want? What do we do with them?

There are 2,443,553 samples on GEO. The highest "GSM" labelled ones appear start at GSM444870. Should I just work backwards from there?

What is our strategy for this scrape. What do we want to survey for? How do we want to process the data we collect?

jaclyn-taroni commented 6 years ago

Do you have a link to documentation for the file formats you've listed? That way I can frame my answer in terms of that documentation. My understanding is that SOFT and MiNiML contain the same information but one is plain text and the other is XML and if so, it may be a matter of preference.

Miserlou commented 6 years ago

From here:

screen shot 2018-04-02 at 2 42 45 pm

Example sample: https://www.ncbi.nlm.nih.gov/sites/GDSbrowser?acc=GDS6248 (files to the right, along with a Cluster Analysis chart - do we want to pinch that too?)

screen shot 2018-04-02 at 2 46 00 pm

I have also seen CEL files for some samples.

I'm guessing the search interface will be the starting point for our queries? There are lots of options, I don't know what we want to filter in/out.

Miserlou commented 6 years ago

On a search like this - some have a "Download data: CEL", and some just have "Download data". If it has CEL files, they are in the "Supplementary data for Series", which look something like GSE20489_RAW.tar.

cgreene commented 6 years ago

We would prefer CEL data if available, though these are specific to the affy platforms. For other cases, we might prefer raw data (illumina beadarrays - but do people actually upload any raw data?) and/or this may change as we develop new processors going forward.

Miserlou commented 6 years ago

And what should we do with the ones that don't have CEL files?

ex: https://www.ncbi.nlm.nih.gov/geo/download/?acc=GDS5819 - from https://www.ncbi.nlm.nih.gov/sites/GDSbrowser?acc=GDS5819

jaclyn-taroni commented 6 years ago

We want to work with GEO Series (designated with GSE; see GEO overview), rather than DataSets (the GDS in the example above). On the Series level, this is essentially ArrayExpress with a different API, which means all the ways that we deal with different platforms and raw vs. processed data and sample metadata, etc. in ArrayExpress hold here.

As @kurtwheeler said, much of GEO is replicated in ArrayExpress, but that has recently stopped and we've noted anecdotally that sometimes raw data from Agilent or Illumina in GEO doesn't get replicated in ArrayExpress (E-GEOD-68061 vs. GSE68061).

One way I could think of to deal with this: 1) start with ArrayExpress, if there is no raw data on ArrayExpress and it is a GEO accession, check GEO for raw data. 2) also account for the fact that new GEO data will not be in ArrayExpress in some way.

I don't know all the design implications for this, so an alternative approach may be more desirable.

Miserlou commented 6 years ago

Gotcha, thank you Jackie!

Miserlou commented 6 years ago

Notes to myself, scenarios to handle. Assuming the existence of a SCAN_TwoColor processor:

Has raw, both at AE and GEO: GSE35186 / E-GEOD-35186 https://www.ebi.ac.uk/arrayexpress/experiments/E-GEOD-35186/ https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE35186

Has AE entry, but raw only on GEO: GSE68061/E-GEOD-68061 https://www.ebi.ac.uk/arrayexpress/experiments/E-GEOD-68061/ https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE68061

Has NO AE entry, raw available on GEO: XXX: NEEDS EXAMPLE!

Has NO AE entry, only processed available on GEO: XXX: NEEDS EXAMPLE!

Miserlou commented 6 years ago

Has NO AE entry, raw available on GEO (RNA-seq): https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE88061

Miserlou commented 6 years ago

Has NO AE entry, only processed available on GEO (RNA-seq): https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE85073

Miserlou commented 6 years ago

Has NO AE entry, raw available on GEO (Microarray [I think]): https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE87009

kurtwheeler commented 6 years ago

One thing that I'm unclear about is getting RNAseq data from GEO. In all of our earlier discussions about building out the salmon pipeline, including those involving the Patro lab, we only ever planned on getting RNAseq data from SRA. I had heard mention that we'd probably want to get gene expression data from GEO, but until recently I hadn't heard much about getting RNAseq data from there as well.

Now I'm definitely all about getting as much data as possible, I'd just like to make sure that we've considered all the possible ramifications of getting data from GEO as well. Have we had any discussions or given any thought to this yet? Browsing through GEO's FAQ, it seems to be that all RNAseq data is stored in SRA:

For next-generation sequencing, GEO brokers the complete set of raw data files, e.g., FASTQ, to the SRA database on your behalf.

What data types are provided with next-generation sequence submissions? Raw sequence data files: Raw data are loaded to NCBI's Sequence Read Archive (SRA) database. Use the SRA Run Selector to list and select runs to be downloaded or analyzed with the SRA Toolkit.

If this is in fact the case, then finding and downloading data from both SRA and GEO would result in duplicate samples. If however SRA is not a perfect super-set of GEO in terms of RNAseq, then we need to take extra care in ensuring that samples we find through GEO are not in SRA.

cgreene commented 6 years ago

NCBI SRA/GEO are run by the same people. I'd be surprised to learn of a case where there is a GEO entry and an SRA entry for raw data but they are not linked. I'm not saying it's impossible, but the system appears to generally be designed to prevent that.

Miserlou commented 6 years ago

Tracking weird ones for myself:

https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE32413 Has RAW, but RAW is only a 6M file called GPL4133_old_annotations.txt. ¯_(ツ)_/¯

Miserlou commented 6 years ago

Status update!

I'm at the processing step.. and there is, I now believe, a bug in the published version of SCAN UPC.

With RPy2:

(Pdb) scan_upc('/home/user/data_store/GSE7702/raw/GSM187290.txt')
*** rpy2.rinterface.RRuntimeError: Error in sampleNames(expressionSet) <- sampleNames :
  could not find function "sampleNames<-"

Confirming with system R:

$ R

R version 3.4.4 (2018-03-15) -- "Someone to Lean On"
Copyright (C) 2018 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

> SCAN_TwoColor("GSM1072833")
Error in SCAN_TwoColor("GSM1072833") :
  could not find function "SCAN_TwoColor"
> SCAN.UPC::SCAN_TwoColor("GSM1072833")
Setting options('download.file.method.GEOquery'='auto')
Setting options('GEOquery.inmemory.gpl'=FALSE)
Downloading GSM1072833 directly from GEO to /tmp/Rtmp0l8oMV.
trying URL 'https://ftp.ncbi.nlm.nih.gov/geo/samples/GSM1072nnn/GSM1072833/suppl//GSM1072833_252747810638_020510_S01_CGH_105_Dec08_1_2.txt.gz?tool=geoquery'
Content type 'application/x-gzip' length 45609594 bytes (43.5 MB)
==================================================
downloaded 43.5 MB

Normalizing /tmp/Rtmp0l8oMV/GSM1072833_252747810638_020510_S01_CGH_105_Dec08_1_2.txt.gz
Updating duplicate probe names.
Error in sampleNames(expressionSet) = sampleNames :
  could not find function "sampleNames<-"
>
jaclyn-taroni commented 6 years ago

I am looking at this from my phone so forgive me but I think it might be because that sample is on an array designed to assess copy number rather than gene expression?

Miserlou commented 6 years ago

Don't think so..

cgreene commented 6 years ago

This one (system R) looks like DNA: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM1072833

The one you were doing via rpy looks like RNA though: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE7702

Any reason for trying a different sample via the two?

Stephen Piccolo is @srp33 on github if we do end up needing him, but can you try the RNA one via system R too?

Miserlou commented 6 years ago

I tried a bunch with no success.

srp33 commented 6 years ago

I'll take a look as soon as I get a chance. I may have to pull in the limma package. If you could send me a list of a few that do work (if any) and a few that don't work, that would be helpful.

cgreene commented 6 years ago

My impression is that #212 closed this. It's also crossed off on the whiteboard.