barricklab / breseq

breseq is a computational pipeline for finding mutations relative to a reference sequence in short-read DNA resequencing data. It is intended for haploid microbial genomes (<20 Mb). breseq is a command line tool implemented in C++ and R.
http://barricklab.org/breseq
GNU General Public License v2.0
142 stars 21 forks source link

Mixed populations of the same 'species' - two complete reference genomes #314

Closed dannagifford closed 2 years ago

dannagifford commented 2 years ago

Hi all*,

We've done a selection experiment where we started with two different strains of E. coli. We now want to find out which strain won, if the winner gained DNA from the loser and if there are any de novo mutations in the winner. We have ~closed reference genomes (+plasmids) for the ancestors (obtained via hybrid nanopore and Illumina sequencing and annotated with Prokka), and Illumina data for the evolved strains.

As I see it, there's two ways I could do this

  1. Map all reads to each genome separately, i.e. do 2 breseq runs
  2. Give both complete genome reference sequences to breseq at the same time

Since they're both E. coli, I think there will be a lot of reads that map equally well to both ancestors (i.e. the 'core' genome for the 2 strains), and many that will map to only one.

With (1), there would be potential for spurious CNV calls for 'core' genes (so I would run without CNV detection). I also wouldn't get the insertion locations.

Conceptually (2) seems like a better idea, as it could potentially give me the insertion locations of any acquired DNA. However, will I run into issues with reads that map multiply?

Or should I do something else, like try to combine the 2 reference genomes, removing any common genes?

I will probably try both and report back, but would be glad for any thoughts

Danna

*edited to 'all' rather than 'Jeff'

vr1087 commented 2 years ago

Hi Danna. Your question is posed directly to Jeff, but I find your question interesting so I hope you don't mind my two cents.

What you propose will probably not give you the resolution you are hoping for. You should consider what the average nucleotide identity between the two strains is, and how many sequences are unique to each. As I understand it, breseq will not call mutations where sequencing reads map ambiguously, i.e. repeat coverage, multi-mapping, etc. So, under option two, you may be 'blind' to variants in common sequences, but be able to predict variants in unique sequences.

To determine the 'winning strain' you could use a quantitative PCR-based approach with primers targeting discriminating loci in each strain. And to get a higher resolution on the variants that occurred in each strain, I would suggest sequencing single colony isolates from the ending population and calling variants with the appropriate reference. However, predicting horizontal gene transfer events would be a little tricky since breseq can not predict insertions of novel sequences - meaning, the insertion sequence has to be provided as an argument. One idea is to de novo assemble the single colony dataset and do a whole-genome alignment to the ancestrial genome reference to identify insertions and then check those insertions against the genome of the competing strain. However, tools for detecting horizontal gene transfers probably exist. So, I would look to those first before implementing a solution from scratch.

dannagifford commented 2 years ago

Hi @vr1087

Of course--sorry I didn't mean to exclude others' suggestions at all. I'm very grateful for anyone's cents :). Thank you for your comments!

I posted because I think it's an interesting potential use-case for breseq if the multi-mapping issue can be worked around. breseq is (as far as I'm concerned) the premier tool for variant calling from experimental evolution, and I think people who are using breseq are starting to think about HGT in an experimental evolution context. I remember Rohan Maddamsetti (with Richard Lenski, https://doi.org/10.1371/journal.pgen.1007199) doing something similar, and he used an earlier version of breseq for some of the steps.

breseq will not call mutations where sequencing reads map ambiguously

This was my fear, thanks for confirming as I thought I'd read that somewhere in the documentation or source. Is this strictly true with --minimum-mapping-quality set to 0 (which is the default)? 0 is the mapping quality from bowtie2 for a read with multiple best hits.

One idea is to de novo assemble the single colony dataset and do a whole-genome alignment to the ancestrial genome reference to identify insertions and then check those insertions against the genome of the competing strain

We have de novo assemblies for the clones so this is a possibility, thanks for the suggestion.

To determine the 'winning strain' you could use a quantitative PCR-based approach with primers targeting discriminating loci in each strain

This would be a great suggestion for determining frequencies in a mixed population. However, at this stage we have in almost all cases a 'pure' population due to having imposed strong selection (i.e. an evolutionary rescue-type scenario), as suggested by de novo assembly.

jeffreybarrick commented 2 years ago

It's an interesting/tricky experiment to analyze.

Yes, it's true that breseq will not call mutations from reads that map equally well to different locations in the reference sequences even if the minimum MAPQ setting is 0.

I'd try approach 2 (use all references at the same time). If enough reads map uniquely to each reference, then you may be able to estimate the frequencies of the two strains from the relative coverage of their chromosomes. The coverage numbers omit positions that have any ambiguously mapped reads, so there will need to be enough of those to give meaningful coverage numbers. I would probably use the "raw" rather than the "negative binomial fit" coverage values (you can find these in the *.json output file), or at least inspect the coverage graphs to see if the fits look good. It's possible the fits will fail or be off if there are very few positions with only uniquely mapped reads.

I agree that for detecting novel HGT, you probably want clonal data and de novo assembly. QUAST generates some nice output for comparing contigs to reference genomes.

For exchange of DNA between your strains, you can probably use breseq to look for some genome regions (or plasmids) that persist from the losing strain (or have anomalous coverage), if one type is mostly winning.

Another followup note:

*Please don't use the current --cnv option!!! It is so flawed as to not be useful in our experience. We are working on implementing some of the better approaches that are out there for this.

dannagifford commented 2 years ago

Thanks for weighing in Jeff--I have so far tried a few things, including option 2, and at least for the set of sequences I've tried so far everything mapped to only one (suggesting there was no HGT, which is an entirely possible outcome). I think I have to hand over the analysis to my PhD student though. I'll report back if we get it going.