barricklab / breseq

breseq is a computational pipeline for finding mutations relative to a reference sequence in short-read DNA resequencing data. It is intended for haploid microbial genomes (<20 Mb). breseq is a command line tool implemented in C++ and R.
http://barricklab.org/breseq
GNU General Public License v2.0
143 stars 21 forks source link

Polymorphism score to detect cross-sample contamination? #361

Closed KR0manova closed 8 months ago

KR0manova commented 11 months ago

Hello,

Thank you for making such a great tool. I would greatly appreciate some advice:

I am generating shortread whole genome data of a large number of bacterial samples. All the samples are the same species and closely related, and cross-contamination between samples is a big concern. I was considering using the Polymorphism Score option to identify such cross-contaminations. I currently have a reference genome but will likely be using de novo assembly instead, as some strains are likely to be more divergent. Given that this isn't the actual purpose of the tool, do you have recommendations on the best way to use the Polymorphism Score to get this kind of insight into the read data, or perhaps an alternate way to solve this problem? Thank you kindly!

jeffreybarrick commented 11 months ago

breseq is for resequencing when you have a very close reference genome.

If this is the case, then running your samples all against this genome in polymorphism mode can help you find cross-contamination. You could calculate a similarity score between samples that is based on taking a "dot product" between the vector of frequencies of all mutation predictions in two samples. (Use gdtools ANNOTATE with --format CSV or --format TABLE on all the outputs to make a table that could be used for this.) Making a tree from this similarity score matrix can be useful for detecting mixed samples. We have done this previously on populations from a phage evolution experiment to remove some that are potentially cross-contaminated.

I don't quite understand the comments about a Polymorphism Score. This is a score of the statistical evidence that any one prediction is a polymorphism versus a consensus mutation. So, you have one for every prediction, and its magnitude depends on factors like read-depth in a sample, and not just the frequency of the predicted polymorphism.

If you mix in de novo assembly, the problem becomes much more complicated, as the same variant will appear at different coordinates in different samples and de novo assembles typically have missing regions due to repeats that make mapping reads more error prone. You can use -c versus -r to help take this into account, but the output will still be messier, unless your strains are very different from the reference you have. In this case, investing in some long-read data to get a closed genome for each one and mapping the reads from your experiment to each one or all of them at once could help you find cases of cross-contamination.

Hope this helps!