ewels / Labrador

A web based tool to manage and automate the processing of publicly available datasets.
https://www.bioinformatics.babraham.ac.uk/projects/labrador/
GNU General Public License v3.0
38 stars 9 forks source link

Problems with the requests to the SRA database #8

Closed crauterb closed 9 years ago

crauterb commented 9 years ago

I am just setting up Labrador and encountered some problems with my first test sets: I was gonna set up for the data from here, that has just the SRA number - which I thought to be no problem. However, Labrador could not look it up, forming this request. After checking the code and the documentation from NCBI, I found that this request was send to the wrong database apparently. After changing the database from db=gds to db=sra, it all worked fine for me, forming this request. I used the Stable Version 0.2 recommended in your side, but checking the master branch I found the same problem persists there as well, namely that is the file _./ajax/sra_getproject.php The file _./ajax/sra_getdata.php seems to poll to the right databases.

ewels commented 9 years ago

Hi @crauterb,

Thanks for your issue report - and great that you're interested in running Labrador! This is challenging the brain cells a little as I wrote this project a little while ago and haven't been running it myself recently. However, if I remember correctly, the db=gds behaviour was intentional. I have a feeling that the responses when searching the sra database, the API returned insufficient data or something. Many SRA projects are also submitted to the GEO database, and searching GEO db=gds returned better results. However, as you've found - there are some projects that are in one and not the other.

My memory with this is all a bit hazy though. There's a comment in the code that supports my guess that it was intentional though:

// Get the first XML file with GEO ID accessions, using the supplied SRA accession

As an example, using a SRA project which is present in GEO, this is what we get with the two database searches. Firstly, using db=gds:

http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=gds&term=SRP024385&usehistory=y
# Second link will expire, but for completeness:
http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esummary.fcgi?db=gds&query_key=1&WebEnv=NCID_1_60195812_130.14.18.34_9001_1443001948_2087594037_0MetA0_S_MegaStore_F_1

gives:

<eSummaryResult>
<DocSum>
    <Id>200047766</Id>
    <Item Name="Accession" Type="String">GSE47766</Item>
    <Item Name="GDS" Type="String"></Item>
    <Item Name="title" Type="String">Deep sequencing of the murine Igh repertoire reveals complex regulation of non-random V gene rearrangement frequencies</Item>
    <Item Name="summary" Type="String">A diverse antibody repertoire is formed through the rearrangement of V, D, and J segments at the immunoglobulin heavy chain (Igh) loci. The C57BL/6 murine Igh locus has over 100 functional VH gene segments that can recombine to a rearranged DJH. While the non-random usage of VH genes is well documented, it is not clear what elements determine recombination frequency. To answer this question we conducted deep sequencing of 5’-RACE products of the Igh repertoire in pro-B cells, amplified in an unbiased manner. ChIP-seq results for several histone modifications and RNA polymerase II binding, RNA-seq for sense and antisense non-coding germline transcripts, and proximity to CTCF and Rad21 sites were compared to the usage of individual V genes. Computational analyses assessed the relative importance of these various accessibility elements. These elements divide the Igh locus into four epigenetically and transcriptionally distinct domains, and our computational analyses reveal different regulatory mechanisms for each region. Proximal V genes are relatively devoid of active histone marks and non-coding RNA in general, but having a CTCF site near their RSS is critical, suggesting that position near the base of the chromatin loops is important for rearrangement. In contrast, distal V genes have high levels of histone marks and non-coding RNA, which may compensate for their poorer RSS and for being distant from CTCF sites. Thus, the Igh locus has evolved a complex system for the regulation of V(D)J rearrangement that is different for of each the four domains that comprise this locus.</Item>
    <Item Name="GPL" Type="String">13112</Item>
    <Item Name="GSE" Type="String">47766</Item>
    <Item Name="taxon" Type="String">Mus musculus</Item>
    <Item Name="entryType" Type="String">GSE</Item>
    <Item Name="gdsType" Type="String">Genome binding/occupancy profiling by high throughput sequencing</Item>
    <Item Name="ptechType" Type="String"></Item>
    <Item Name="valType" Type="String"></Item>
    <Item Name="SSInfo" Type="String"></Item>
    <Item Name="subsetInfo" Type="String"></Item>
    <Item Name="PDAT" Type="String">2013/08/05</Item>
    <Item Name="suppFile" Type="String">BED, BW, WIG</Item>
    <Item Name="Samples" Type="List">
        <Item Name="Sample" Type="Structure">
            <Item Name="Accession" Type="String">GSM1156658</Item>
            <Item Name="Title" Type="String">H3K27me3 ChIP-seq</Item>
        </Item>
        <Item Name="Sample" Type="Structure">
            <Item Name="Accession" Type="String">GSM1156666</Item>
            <Item Name="Title" Type="String">CTCF input control</Item>
        </Item>
        <Item Name="Sample" Type="Structure">
            <Item Name="Accession" Type="String">GSM1156667</Item>
            <Item Name="Title" Type="String">Rad21 ChIP-seq</Item>
        </Item>
        <Item Name="Sample" Type="Structure">
            <Item Name="Accession" Type="String">GSM1156664</Item>
            <Item Name="Title" Type="String">H3K4me2 H3K4me3 input control</Item>
        </Item>
        <Item Name="Sample" Type="Structure">
            <Item Name="Accession" Type="String">GSM1156661</Item>
            <Item Name="Title" Type="String">PolII input control</Item>
        </Item>
        <Item Name="Sample" Type="Structure">
            <Item Name="Accession" Type="String">GSM1156662</Item>
            <Item Name="Title" Type="String">H3K4me2 ChIP-seq</Item>
        </Item>
        <Item Name="Sample" Type="Structure">
            <Item Name="Accession" Type="String">GSM1156659</Item>
            <Item Name="Title" Type="String">H3ac H3K27me3 input control</Item>
        </Item>
        <Item Name="Sample" Type="Structure">
            <Item Name="Accession" Type="String">GSM1156665</Item>
            <Item Name="Title" Type="String">CTCF ChIP-seq</Item>
        </Item>
        <Item Name="Sample" Type="Structure">
            <Item Name="Accession" Type="String">GSM1156668</Item>
            <Item Name="Title" Type="String">Rad21 input control</Item>
        </Item>
        <Item Name="Sample" Type="Structure">
            <Item Name="Accession" Type="String">GSM1156660</Item>
            <Item Name="Title" Type="String">PolII ChIP-seq</Item>
        </Item>
        <Item Name="Sample" Type="Structure">
            <Item Name="Accession" Type="String">GSM1156663</Item>
            <Item Name="Title" Type="String">H3K4me3 ChIP-seq</Item>
        </Item>
        <Item Name="Sample" Type="Structure">
            <Item Name="Accession" Type="String">GSM1156657</Item>
            <Item Name="Title" Type="String">H3ac ChIP-seq</Item>
        </Item>
    </Item>
    <Item Name="Relations" Type="List"></Item>
    <Item Name="ExtRelations" Type="List">
        <Item Name="ExtRelation" Type="Structure">
            <Item Name="RelationType" Type="String">SRA</Item>
            <Item Name="TargetObject" Type="String">SRP024385</Item>
            <Item Name="TargetFTPLink" Type="String">ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByStudy/sra/SRP/SRP024/SRP024385/</Item>
        </Item>
    </Item>
    <Item Name="n_samples" Type="Integer">12</Item>
    <Item Name="SeriesTitle" Type="String"></Item>
    <Item Name="PlatformTitle" Type="String"></Item>
    <Item Name="PlatformTaxa" Type="String"></Item>
    <Item Name="SamplesTaxa" Type="String"></Item>
    <Item Name="PubMedIds" Type="List">
        <Item Name="int" Type="Integer">23898036</Item>
    </Item>
    <Item Name="Projects" Type="List"></Item>
    <Item Name="FTPLink" Type="String">ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE47nnn/GSE47766/</Item>
    <Item Name="GEO2R" Type="String">no</Item>
</DocSum>
[ .. truncated .. ]

Compare this with the results from using db=sra:

http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=sra&term=SRP024385&usehistory=y
http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esummary.fcgi?db=sra&query_key=1&WebEnv=NCID_1_60168221_130.14.18.34_9001_1443001813_1903774330_0MetA0_S_MegaStore_F_1

gives:

<eSummaryResult>
<DocSum>
    <Id>420530</Id>
    <Item Name="ExpXml" Type="String">&lt;Summary&gt;&lt;Title&gt;GSM1156668: Rad21 input control; Mus musculus; ChIP-Seq&lt;/Title&gt;&lt;Platform instrument_model="Illumina HiSeq 2000"&gt;ILLUMINA&lt;/Platform&gt;&lt;Statistics total_runs="1" total_spots="27071880" total_bases="2707188000" total_size="1558597050" load_done="true" cluster_name="public"/&gt;&lt;/Summary&gt;&lt;Submitter acc="SRA089211" center_name="GEO" contact_name="Gene Expression Omnibus (GEO), NCBI, NLM, NIH, htt" lab_name=""/&gt;&lt;Experiment acc="SRX298239" ver="1" status="public" name="GSM1156668: Rad21 input control; Mus musculus; ChIP-Seq"/&gt;&lt;Study acc="SRP024385" name="Deep sequencing of the murine Igh repertoire reveals complex regulation of non-random V gene rearrangement frequencies"/&gt;&lt;Organism taxid="10090" ScientificName="Mus musculus"/&gt;&lt;Sample acc="SRS438342" name=""/&gt;&lt;Instrument ILLUMINA="Illumina HiSeq 2000"/&gt;&lt;Library_descriptor&gt;&lt;LIBRARY_STRATEGY&gt;ChIP-Seq&lt;/LIBRARY_STRATEGY&gt;&lt;LIBRARY_SOURCE&gt;GENOMIC&lt;/LIBRARY_SOURCE&gt;&lt;LIBRARY_SELECTION&gt;ChIP&lt;/LIBRARY_SELECTION&gt;&lt;LIBRARY_LAYOUT&gt; &lt;SINGLE/&gt; &lt;/LIBRARY_LAYOUT&gt;&lt;LIBRARY_CONSTRUCTION_PROTOCOL&gt;B6 Rag1−/− pro-B cells were crosslinked with 1% formaldehyde for 10 minutes at room temperature. Subsequently, the lysates were sonicated using a Diagenode Biorupter. The chromatin solution was precleared with salmon sperm DNA-protein A agarose beads. The lysate was immunoprecipitated using the following antibodies; CTCF, H3K4me2, H3ac, and H3K27me3 were purchased from EMD Millipore (Billerica, MA), H3K4me3 from Active Motif (Carlsbad, CA), and Rad21 from Abcam (Cambridge, England). Immune complexes were isolated with protein A agarose beads. Following elution, chromatin-antibody complexes and input DNA were reverse crosslinked by heating at 65°C overnight. The DNA was purified using the Qiagen DNA purification kit. The sequencing libraries were prepared using 10 ng of DNA. ChIP DNA sample ends were repaired using the recommended Illumina protocol, including T4 DNA polymerase, Klenow Large Fragment, and T4 polynucleotide kinase. DNA products were purified using the DNA Clean&amp;amp;Concentrator- 5 Kit (Zymo Research). The DNA ends were A-tailed with Klenow Large Fragment (3′→5′ exo-) at 37 °C for 30 min. DNA products were again purified using the DNA Clean&amp;amp;Concentrator- 5 Kit. Next, Illumina Paired End-adapter oligonucleotides (2), at a concentration of 0.33 μM, were ligated to the A-tailed cDNA ends with T4 DNA ligase. DNA products were purified using the DNA Clean&amp;amp;Concentrator-5 Kit. The DNA library products were separated on a 2% (wt/vol) agarose gel, and products corresponding to a size of ∼200–250 bases were removed from the gel and cleaned using the Agencourt SPRI system (Beckman). The DNA material was PCR-amplified with Phusion Polymerase (Finnzymes) with 0.6 μM PCR primers PE 1.0 and PE 2.0 (Illumina) for 15 cycles. The amplified DNA products were further purified on 2% (wt/vol) agarose gel, excised, and isolated again using the ZymocleanGel DNA recovery kit (Zymo Research). The purified DNA library was quantitated using the Qubit quantitation platform (Invitrogen) and sized using the 2100 Bioanalyzer (Agilent). DNA products were then denatured in 0.1 N of NaOH and diluted to a final concentration of 10 pM before being loaded onto the Illumina single read flow cell for 100 base sequencing by synthesis on the Illumina HiSeq2000.&lt;/LIBRARY_CONSTRUCTION_PROTOCOL&gt;&lt;/Library_descriptor&gt;&lt;Bioproject&gt;SRP024385&lt;/Bioproject&gt;</Item>
    <Item Name="Runs" Type="String">&lt;Run acc="SRR891545" total_spots="27071880" total_bases="2707188000" load_done="true" is_public="true" cluster_name="public" static_data_available="true"/&gt;</Item>
    <Item Name="ExtLinks" Type="String"></Item>
    <Item Name="CreateDate" Type="String">2013/08/05</Item>
    <Item Name="UpdateDate" Type="String">2013/09/23</Item>
</DocSum>
[ ... truncated ... ]

Note that this response gives html encoded XML content within an XML tag (:cold_sweat:), and that content is highly variable.

So - this explains why the current code is the way that it is. But it doesn't really help you much. Whilst not really a bug in the strictest sense, it would certainly be helpful for the labrador code to be a bit more flexible when it comes to querying these APIs and potentially try a few different approaches to scrape as much information as it can.

Now the down side - I don't really have the time to do this any time in the near future. So, pull requests welcome! Otherwise I'll do my best to get to this soon, but I wouldn't hold my breath. Apologies for this.

Phil

crauterb commented 9 years ago

Thanks for the quick answer. I'll have some more goes on it an check it out. As I was thinking about including some other databases as well, I might do something in a sort of "clearing" this up a bit, as I am getting into the code quite a bit over here. So: No need to apologize, if I produce something worth noting I might get back to you. Or if I just have some more questions. Guess this is closed than :-)

Cheerio, Christoph

ewels commented 9 years ago

Ok great! Let me know if I can help at all..

Phil