building a contamination database that is small so it downloads and runs quickly

taylorreiter commented 2 years ago

One of the goals of this pipeline is to screen new sequencing data for contamination.

Contamination in sequencing data can come from a lot of different sources:

Contam type 1: contamination from barcode/index hopping. This happens most frequently for low-biomass samples and is an illumina artifact. Computational identification approach: Screen for sequences of model organisms that are sequenced frequently, as these will be the sequences that are most likely to occur as contaminants because they are the most likely things to be sequenced at any given time. If we get mouse in our metagenome or in chlammy rna seq (esp before we have a terrarium), it’s probably from barcode hopping
Contam type 2: contamination from humans handling the sample. This could be human sequence, or sequence from microbes that live on human skin/oral cavity (like S. aureus). Computational identification approach: Include human DNA and human skin/oral microbiome species in the database.
Contam type 3: kit contamination. Kits and reagents have their own microbiome and so DNA extracted from these organisms can sneak into the sample Computational identification approach: Add most common kit contaminant organisms to the database. This paper reviews and lists common kit contaminants:

Eisenhofer, R., Minich, J. J., Marotz, C., Cooper, A., Knight, R., & Weyrich, L. S. (2018). Contamination in Low Microbial Biomass Microbiome Studies: Issues and Recommendations. Trends in Microbiology. doi:10.1016/j.tim.2018.11.003
Contam type 4: there’s lab contamination, so accidentally extracting DNA or RNA from other organisms that are in the lab. This would be chlammy for Arcadia, and any other organism that is brought into the lab Computational identification approach: Select species Arcadian's work with, create signatures for those (masked) genomes, and add them to the database
Contam type 5: spike in contamination. Illumina spikes phiX into many of of its sequencing runs. Computational identification approach: create a sourmash signature for phix and include it in the contamination data base
Contam type 6: Contamination in the sample material itself. This might be something like mycorrhizal fungi that sneaks into a plant genome sequencing run. Computational identification approach: The best way to identify this type of contamination would be to screen with all known genomic/transcriptomic data, but this doesn't scale super well. This pipeline might do a poor job of detecting this type of contamination.

taylorreiter commented 2 years ago

Model organisms (listed from memory, and from searching things like "illumina" on the SRA and seeing what taxonomies have the highest count):

[x] human
[x] mouse
[x] drosophila melanogaster
[x] zebrafish
[x] c. elegans
[x] s. cerevisiae
[x] plasmodium falciparum
[ ] sars cov2
[x] hordeum vulgare
[x] arabadopsis thaliana
[x] triticum aestivum

Arcadia organisms:

[x] TBD

Other:

[x] phix
[x] microbes in Eisenhofer et al (see above) probably cover kit contamination and human microbiome contamination -> picklist gtdb database

taylorreiter commented 2 years ago

Done here! https://github.com/Arcadia-Science/seqqc-build-contam-db. Integrated into workflow in #8

Arcadia-Science / seqqc

building a contamination database that is small so it downloads and runs quickly #9