barricklab / breseq

breseq is a computational pipeline for finding mutations relative to a reference sequence in short-read DNA resequencing data. It is intended for haploid microbial genomes (<20 Mb). breseq is a command line tool implemented in C++ and R.
http://barricklab.org/breseq
GNU General Public License v2.0
149 stars 21 forks source link

Detact mutations in mixtures of strains #276

Closed termithorbor closed 1 month ago

termithorbor commented 3 years ago

Hi,

we have set up an experiment with a mixture of three strains and we want to check if these strains evolve over time. We think about using a mapper like bwa or minimap2 before running breseq to separate the evolved strains. Any thoughts on that or is there probably a tool available which is similar to breseq and able to deal with mixtures?

Thanks in advance :)

jeffreybarrick commented 3 years ago

How close in sequence are the three strains? (Are they different species or very similar?) If they are sufficiently distinct that you will be able to call mutations, then there shouldn't be a need to run a mapper to separate the reads before using breseq. Just provide the reference option (-r) three times, so breseq knows about all three references and it will map to all of them and call mutations in all of them. It seems like you should also use the -p polymorphism option, since you expect each of the three strains to be a mixed population (rather than clonal). This should work fine as long

termithorbor commented 3 years ago

We have two Streptococcus thermophilus (ST) strains and one Lactobacillus delbrueckii subsp. bulgaricus (LB) strain. However, we already know, that by the end of the experiment one ST is 98 %, the other one less than 1 % and the LB about 1 % of the sample. Is that a problem? I fear that also a mapper might not really help here.

Strainseeker results:

98.60991% KNOWN ST1 0.84138% KNOWN LB 0.54871% KNOWN ST2

jeffreybarrick commented 3 years ago

I would still try running breseq with all three references at the same time.

It's probably going to be hard to call mutations in the ST2 strain (no matter what your approach) given its low frequency and what I expect is high sequence similarity to ST1 – at least in regions of the genome in which they are similar. At least if you map to both at the same time, you should be able to get rid of false predictions from ST2 reads mapping to ST1.

If you have enough read-depth coverage of the LB, then you can probably identify mutations in it.

In general, analyzing mixed populations can get more complicated than expected. I usually recommend picking some clonal isolates and sequencing them to at least provide some high confidence predictions of mutations so that you can know what some true-positives look like in the mixed populations.

termithorbor commented 3 years ago

Thanks for the help. I encountered this error when using all 3 GeneBank annotations at the same time:

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!> FATAL ERROR <!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! Duplicate seq id found in file 'Streptococcus_thermophilus_ST115.gb'! Features for '1' were already loaded from file 'Lactobacillus_delbrueckii_subsp_bulgaricus_LB202.gb'.

Solution: I labeled LOCUS and ACCESSION in the Genebank files differently (they all had 1) and now it works. (I just changed them to 1, 2 and 3)

However, would it make sense to test different Polymorphism Read Alignment (RA) Evidence Options and where would you start?

Additionally I got this warning:

----------------------------------> WARNING <----------------------------------- Insufficient coverage to call mutations for some reference sequences. Set either the --targeted-sequencing or --contig-reference option if you want mutations called on these reference sequences.

Insufficient coverage seq ids: 1

This is one of the rare strains in the sample. What would you suggest? Is there any other tool which can deal better with strain mixtures?

jeffreybarrick commented 3 years ago

If you try --targeted-sequencing mode, then it will force it to map reads to those references and try to call mutations. You could see what the coverage graph for them looks like in this case. My guess is that it will show you that you just don't have enough reads mapping to the rare strains to figure out whether they have mutations.

termithorbor commented 3 years ago

Thanks again. A quite high amount of reads is mapping to ST386 which was not detected in the sample by Strainseeker. However, I think there is nothing I can do about it due to the high similarity of ST115 and ST386. How would you interpret the results in such cases?

Additionally, when I use --targeted-sequencing, there are still no mutations detected in seq id 1. Do I probably need to use --contig-reference since my references are GenBank files?

image

jeffreybarrick commented 3 years ago

This high % of reads reflects that many reads map equally well to ST115 and ST386. This explains the "% mapped reads column" not agreeing with the "fit mean" column. The latter only looks at the reads that can be uniquely assigned to one or the other... it shows that ST386 is very rare.

Having reads map equally well to both reference sequences will prevent breseq from calling mutations in those shared regions, which seem to be a significant fraction of the genome. You may want to run with just ST115 as the reference genome and see what happens as the % of reads mapping to ST386 and LB202 are so low it should not lead to calling any spurious mutations, such as those that describe the differences between ST115=>ST386.

For all practical purposes, you basically seem to only have enough ST115 data to work with as far as calling mutations.

termithorbor commented 3 years ago

Thanks again. I have another very general question. How many sample replicates would you sequence to be confident with your mutation detection? Are 3 enough when you find the same muation at 100 % in all of them?

jeffreybarrick commented 3 years ago

It depends. The more typical approach is to compare sequencing of different samples rather than doing biological/technical replicates of sequencing the same sample.

One sample with reasonable sequencing coverage (>40x on average) is usually plenty for detecting 100% frequency variants accurately.

When you get into the <20% or especially the <5% frequency range, it's usually better to sequence a variety of other samples that don't have the mutation (like independent lineages or the ancestor), so you can judge what the background error rates are for that particular mutation.