ababaian / serratus

Ultra-deep search for novel viruses
http://serratus.io
GNU General Public License v3.0
258 stars 34 forks source link

Annotated list of SRA libraries with known Coronaviruses #16

Closed mathemage closed 4 years ago

mathemage commented 4 years ago

Create an annotated list of SRA libraries with known Coronaviruses in the samples (i.e. SARS cell culture, SARS-CoV-2 accessions).

Originally posted by @ababaian in https://github.com/ababaian/serratus/issues/5#issue-589616145

mathemage commented 4 years ago

@jefftaylor42 @ababaian

mathemage commented 4 years ago

@jefftaylor42 @ababaian I suggest adding tag Bioinformatics to this issue.

ababaian commented 4 years ago

@victorlin is working on this currently

victorlin commented 4 years ago

SRAdb seems to be a good source of metadata. Here is a quick example search using Python API by pysradb (there are more columns cut off):

Screen Shot 2020-04-02 at 5 48 44 PM

Some things to consider:

  1. What information are we looking for - just accession IDs or some other features that come with?
  2. Compile a list of relevant search terms (ex. SARS, MERS, Cov-2, ncov, etc)
victorlin commented 4 years ago

Rather than a potentially sparse list of search terms, an exhaustive search could be achieved by backwards-searching a NCBI taxonomy entry for SRA samples (e.g. Taxnonomy entry for Betacoronavirus, respective Entrez query)

ababaian commented 4 years ago

Now that's a fantastic idea!

superbsky commented 4 years ago

@victorlin please also consider programmed access to the ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdump.tar.gz via https://github.com/frallain/NCBI_taxonomy_tree

as well as https://ftp.ncbi.nlm.nih.gov/genomes/GENOME_REPORTS/viruses.txt which can be incorporated into SRAdb as:

"(echo .mode tabs; echo .import viruses.txt viruses;) | sqlite3 SRAmetadb.sqlite"

it can do nice tricks and be combined with the SRAdb queries in the Python code

victorlin commented 4 years ago

Here is a Jupyter notebook showing the process of retrieving metadata for all SRA experiments under Coronaviridae.

I used biopython's Entrez API to get SRA accession IDs under the Taxonomy listing, followed by pysradb's SRAdb API to fetch metadata.

@ababaian if this is sufficient, let me know how I should share the data; otherwise please advise if any other information is needed :)

ababaian commented 4 years ago

Hey @victorlin this looks great. I think using jupyter notebook / RMD files to track experiments and/or analyses is the way to go moving forward : ) This way we can keep track of what work has been done.

Can you copy over your script using the template available in serratus/notebook/200401_template.ipynb. For the output table, this belongs in serratus/data/sra in the format of an SraRunInfo.csv table. And one last thing would be to add a link to the notebook file to the serratus/data/README.md under sra. You should have push access to the repo now.

victorlin commented 4 years ago

@ababaian ah, if all that's needed is a SraRunInfo.csv of SRA runs under Coronaviridae, that file can be downloaded from the query page directly without any scripting. Wish I had thought of that earlier 😅

I'll make all those edits including the notebook file anyways, since it seems to capture slightly different information than SraRunInfo.csv. Could be useful in the future.

ababaian commented 4 years ago

See also https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE147507

ababaian commented 4 years ago

See also #34

These issues are related, but we can consider this 'done' at least at the superficial level until we resolve the WGS issue as well.

rcedgar commented 4 years ago

Question / new issue / related issue: Do we know any examples of the runs we're trying to find, i.e. human RNA-seq with captured Cov? If I have some, then I can train a quick-and-dirty test to discriminate candidate runs from the background. If we don't know of any, then do we know / can we find any runs with viruses in other families? Those could be used as a model, though this would be a lot more work because we'd have to make new pan-genomes. If we're totally flying blind, then I can make something to sort runs completed so far in order of most promising candidates for manual analysis so that we can monitor until we find some.

ababaian commented 4 years ago

We have a compiled list of libraries which should be CoV positive here

From an older test these accessions have lots of SARS-CoV-2 reads

# CoV+ Control Libraries
SRR11454606 4108
SRR11454607 23362
SRR11454608 184454
SRR11454609 67683
SRR11454610 127472
SRR11454611 3597
SRR11454612 6918
SRR11454613 873824
SRR11454614 1894336
SRR11454615 124702