dahak-metagenomics / dahak

benchmarking and containerization of tools for analysis of complex non-clinical metagenomes.
https://dahak-metagenomics.github.io/dahak
BSD 3-Clause "New" or "Revised" License
21 stars 4 forks source link

Create spike-in dataset with Eukaryotic and viral DNA #48

Open brooksph opened 6 years ago

brooksph commented 6 years ago

Expected behavior

These data were identified by Nicolete (SigSci) Updates from Nicolette - https://docs.google.com/document/d/1uYqs939MU55D_3La8RUc8NJIYVrjLmVD-q_Tw5_Bjbg/edit

Viral DNA spike-ins (non-assembled datasets): Organism: Human betaherpesvirus 5 (dsDNA) Technology: Illumina Genome Analyzer II https://www.ncbi.nlm.nih.gov/sra/ERX004415[accn] Technology: Illumina HiSeq 2000 https://www.ncbi.nlm.nih.gov/sra/ERX2083171[accn]
Organism: Human gammaherpesvirus 4 (dsDNA) (alternate name: Epstein-Barr virus) Technology: Illumina HiSeq 2000 https://www.ncbi.nlm.nih.gov/sra/ERX218636[accn] Organism: Vaccinia virus (dsDNA) Technology: Illumina HiSeq 2000 https://www.ncbi.nlm.nih.gov/sra/SRX2421177[accn] Organism: Cowpox virus (dsDNA) Technology: Illumina MiSeq https://www.ncbi.nlm.nih.gov/sra/SRX3106169[accn] Organism: Torque teno virus (ssDNA) Technology: Illumina HiSeq 4000 https://www.ncbi.nlm.nih.gov/sra/SRX1762570[accn] Organism: Adeno-associated virus (ssDNA) Technology: Illumina HiSeq 2500 https://www.ncbi.nlm.nih.gov/sra/SRX1960902[accn] Organism: Human bocavirus 1 (ssDNA) Technology: Illumina HiSeq 2500 https://www.ncbi.nlm.nih.gov/sra/ERX1470610[accn] Organism: Enterobacteria phage T7 Technology: Illumina HiSeq 2500 https://www.ncbi.nlm.nih.gov/sra/SRX2365806[accn] Organism: Enterobacteria phage T3 Technology: Illumina HiSeq 2000 https://www.ncbi.nlm.nih.gov/sra/SRX209596[accn] Organism: Bacillus phage BC01 Technology: Illumina HiSeq 4000 https://www.ncbi.nlm.nih.gov/sra/SRX3214803[accn] Eukaryotic microbe spike-ins: Organism: Saccharomyces cerevisiae Y12 Illumina HiSeq 4000 https://www.ncbi.nlm.nih.gov/sra/SRX2487940[accn] PacBio RS II https://www.ncbi.nlm.nih.gov/sra/SRX2485790[accn] Organism: Schizosaccharomyces kambucha strain:SZY13 Illumina HiSeq 2000 https://www.ncbi.nlm.nih.gov/sra/SRX521792[accn] PacBio RS https://www.ncbi.nlm.nih.gov/sra/SRX521793[accn] Organism: Colletotrichum higginsianum IMI 349063 Illumina HiSeq 1500 https://www.ncbi.nlm.nih.gov/sra/SRX2765599[accn] PacBio RS II https://www.ncbi.nlm.nih.gov/sra/SRX1567884[accn] Organism: [Candida] auris Illumina HiSeq 2500 https://www.ncbi.nlm.nih.gov/sra/SRX1939498[accn] PacBio RS II https://www.ncbi.nlm.nih.gov/sra/SRX1939493[accn] Organism: Fusarium poae isolate 2516 Illumina HiSeq 2000 https://www.ncbi.nlm.nih.gov/sra/SRX1977327[accn] PacBio RS II https://www.ncbi.nlm.nih.gov/sra/SRX1977328[accn]

Actual behavior

Steps to reproduce the behavior

brooksph commented 6 years ago

One exclusively short read and one hybrid

kternus commented 6 years ago

If it's helpful, the simulated "frankengenome" dataset has a crazy mix of short bacterial, archaeal, viral, and eukaryotic reads: https://ftp-private.ncbi.nlm.nih.gov/nist-immsa/IMMSA/UnAmbiguouslyMapped_ds.frankengenome.fq.gz

I attached a truth file for it that follows the same format as the other unambiguously mapped datasets in the McIntyre et al. 2017 study: UnAmbiguouslyMapped_ds_frankengenome_TRUTH.txt

Column 1 = NCBI Taxonomy ID Column 2 = Number of reads simulated from that organism Column 3 = Abundance of that organism in the dataset Column 4 = Rank Column 5 = Species name

kternus commented 6 years ago

The frankengenome won't be a good resource for generating spike-in datasets, but it's an additional dataset option if you'd like to do more testing with reads from viruses and eukaryotes.