Closed mathemage closed 4 years ago
@jefftaylor42 @ababaian
@jefftaylor42 @ababaian I suggest adding tag Bioinformatics to this issue.
@victorlin is working on this currently
SRAdb seems to be a good source of metadata. Here is a quick example search using Python API by pysradb (there are more columns cut off):
Some things to consider:
SARS
, MERS
, Cov-2
, ncov
, etc)Rather than a potentially sparse list of search terms, an exhaustive search could be achieved by backwards-searching a NCBI taxonomy entry for SRA samples (e.g. Taxnonomy entry for Betacoronavirus, respective Entrez query)
Now that's a fantastic idea!
@victorlin please also consider programmed access to the ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdump.tar.gz via https://github.com/frallain/NCBI_taxonomy_tree
as well as https://ftp.ncbi.nlm.nih.gov/genomes/GENOME_REPORTS/viruses.txt which can be incorporated into SRAdb as:
"(echo .mode tabs; echo .import viruses.txt viruses;) | sqlite3 SRAmetadb.sqlite"
it can do nice tricks and be combined with the SRAdb queries in the Python code
Here is a Jupyter notebook showing the process of retrieving metadata for all SRA experiments under Coronaviridae.
I used biopython
's Entrez API to get SRA accession IDs under the Taxonomy listing, followed by pysradb
's SRAdb API to fetch metadata.
@ababaian if this is sufficient, let me know how I should share the data; otherwise please advise if any other information is needed :)
Hey @victorlin this looks great. I think using jupyter notebook / RMD files to track experiments and/or analyses is the way to go moving forward : ) This way we can keep track of what work has been done.
Can you copy over your script using the template available in serratus/notebook/200401_template.ipynb
. For the output table, this belongs in serratus/data/sra
in the format of an SraRunInfo.csv
table. And one last thing would be to add a link to the notebook file to the serratus/data/README.md
under sra
. You should have push access to the repo now.
@ababaian ah, if all that's needed is a SraRunInfo.csv
of SRA runs under Coronaviridae, that file can be downloaded from the query page directly without any scripting. Wish I had thought of that earlier 😅
I'll make all those edits including the notebook file anyways, since it seems to capture slightly different information than SraRunInfo.csv
. Could be useful in the future.
See also #34
These issues are related, but we can consider this 'done' at least at the superficial level until we resolve the WGS issue as well.
Question / new issue / related issue: Do we know any examples of the runs we're trying to find, i.e. human RNA-seq with captured Cov? If I have some, then I can train a quick-and-dirty test to discriminate candidate runs from the background. If we don't know of any, then do we know / can we find any runs with viruses in other families? Those could be used as a model, though this would be a lot more work because we'd have to make new pan-genomes. If we're totally flying blind, then I can make something to sort runs completed so far in order of most promising candidates for manual analysis so that we can monitor until we find some.
We have a compiled list of libraries which should be CoV positive here
From an older test these accessions have lots of SARS-CoV-2 reads
# CoV+ Control Libraries
SRR11454606 4108
SRR11454607 23362
SRR11454608 184454
SRR11454609 67683
SRR11454610 127472
SRR11454611 3597
SRR11454612 6918
SRR11454613 873824
SRR11454614 1894336
SRR11454615 124702
Create an annotated list of SRA libraries with known Coronaviruses in the samples (i.e. SARS cell culture, SARS-CoV-2 accessions).
Originally posted by @ababaian in https://github.com/ababaian/serratus/issues/5#issue-589616145