way too many mutations - Githubissues

barricklab / breseq

breseq is a computational pipeline for finding mutations relative to a reference sequence in short-read DNA resequencing data. It is intended for haploid microbial genomes (<20 Mb). breseq is a command line tool implemented in C++ and R.

http://barricklab.org/breseq

GNU General Public License v2.0

149 stars 21 forks source link

way too many mutations #377

Closed Avylab closed 3 months ago

Avylab commented 3 months ago

Hi, I'm running Breseq to compare two closely related lactobacillus strains and I am getting +10K> mutations. During the run, I get an error saying that the strains seem much different (above 1% difference). It doesn't have a real biologic difference so it left me with a question regarding the run parameters.

I'm using the -r default setting for Illumina reads or the +x ONT setting for ONT samples. my questions are:

Is Breseq optimized for e.coli?
if the answer to 1 is yes, how do I make it compatible with other bacteria?
how do I reduce the mutation prediction and why does it happen ?

thanks for your help and for developing Breseq!

jeffreybarrick commented 3 months ago

breseq predicts mutations relative to the reference genome you provide. It is meant for situations where your reference genome is very, very close to the sample. Ideally the reference is for the starting strain of your experiment.

What I think you are saying about your sequenced genomes is that they should be from strains that are very closely related to one another – maybe differing by only a few mutations?

However, It appears that your reference genome is definitely not closely related to either of your samples!

Because breseq works by aligning reads from your samples to this genome, this means it will report many differences for both (which makes it run slow and produce a ton of output files) and finding the few that are different in these long lists is challenging. Also, it greatly degrades the ability of breseq to find mutations to compare to a genome that is not closely related. It can't find mutations in regions that may be missing in your genome versus the reference because they are not there to align reads to., and reads in regions with high sequence divergence will poorly align.

Here are the two options you have open to you:

1) Find an existing reference genome in NCBI that is actually for your strain - the best solution if it exists! This should get the number of predicted mutations down to just a few that might be differences between your copy of the strain and the stock that was sequenced plus the ones from your experiment.

2) Do novo assemble and annotate your own reference genome. This is a good idea if you want to get things right. With just a bit of nanopore and Illumina data, you can get a closed genome. Then, you can compare your different samples to this sequence using breseq to show the very few mutations – the ones that are biologically meaningful to you.

Let me know if your situation is different or you need advice for de novo assembly.

Avylab commented 3 months ago

so I sent my ancestor for ONT and got a well-assembled genome, I compared it to an evolved strain of the same bacteria (about 500 generations ) and still had the same problem. It's hard for me to believe they diverged so much in such little time. I thought it might be a mixed population but it is not the case. also comparing my raw reads of the ancestor to closely related (based on 16s) to reference genome from NCBI resulted in the same problem. that made me think it's a technical issue and not a biological one.

jeffreybarrick commented 3 months ago

OK. My best guess is that the nanopore reads used for your assembly were not run in the highest accuracy base calling mode and/or your reference genome was not polished with the nanopore reads after assembly (which requires a separate program from the assembler). Initial assemblies of nanopore reads tend to have many small errors (wrong bases and small indels). It's possible polishing will get rid of many/all of these.

The polishing will be even better if you get some Illumina reads for the ancestor. I'd highly recommend doing this! Getting a high-quality reference genome is key to accurately calling mutations in evolved strains.

Good advice on the assembly and polishing process can be found here. See esp. Step 7.

As an additional note, breseq can be used as a polisher. We often do this.

If you take the output/output.gd file for the run of ancestor reads mapped to your ancestor assembly, you can supply this GenomeDiff file to the original reference to generate an updated one using the gdtools utility that is installed alongside breseq. Run this to see the command-line usage.

gdtools APPLY -h

This will generate an updated genome to which you can then compare the reads for your ancestor and evolved sample. Note that you may need to do multiple rounds of polishing (for any polisher program you use) – each time you correct some errors it lets more reads map and correct errors near those errors.

If there are many errors being corrected, you will want to reannotate genes after you do this b/c many indels in reading frames will be corrected. There a section of the breseq manual that discusses annotating reference sequences.

Avylab commented 3 months ago

thank for the ideas! I will look into it . we used Plasmidsaurus NGS hybrid ONT service - so it was polished with Illumina reads for both the ancestor and the evolved bacteria. the contig produced by them also looked good and as far as I can tell gave a good reference genome.

what info would be most useful to troubleshoot it?

thanks!

jeffreybarrick commented 3 months ago

In that case, there must be something biologically different about the DNA sample you sent them for sequencing and assembly versus the DNA samples that you are analyzing with breseq. The data points to them being completely different strains.

Avylab commented 3 months ago

it happened to us in 3 different evolution experiments both between our ancestor and reference and also between evolved and our draft assembly. two of those experiments were conducted on lactobacillus rhamnosus and the 3rd on streptococcus thermophilus. those experiments were done by two different people and resulted in the same problem of very high mutation calling. I'm positive the bacteria we sent are the right ones and not a contamination. we have both ONT reads + polish and Illumina only reads - in both cases, the problem persists. other samples of both e.coli and other non-standard bacteria work fine.

all that made me think it was a technical issue and not a biological one - how would you go about debugging such an issue?

dannagifford commented 3 months ago

FWIW I have seen # of mutations in the 350+ range in a multi-locus mutator strain of E. coli K-12, after a week-long experiment.

But I've also used the wrong ONT+Illumina reference genome as a reference once (labeling error) and also getting 10k+ mutation calls for a clinical E. coli (two different isolates of ST131). When I swapped to the correct genome (also ONT+Illumina reference), the numbers of mutations were much smaller for the length of experiment (1 month in this case).

On Thu, 8 Aug 2024, 15:50 Avylab, @.***> wrote:

it happened to us in 3 different evolution experiments both between our ancestor and reference and also between evolved and our draft assembly. two of those experiments were conducted on lactobacillus rhamnosus and the 3rd on streptococcus thermophilus. those experiments were done by two different people and resulted in the same problem of very high mutation calling. I'm positive the bacteria we sent are the right ones and not a contamination. we have both ONT reads + polish and Illumina only reads - in both cases, the problem persists. other samples of both e.coli and other non-standard bacteria work fine.

all that made me think it was a technical issue and not a biological one - how would you go about debugging such an issue?

— Reply to this email directly, view it on GitHub https://github.com/barricklab/breseq/issues/377#issuecomment-2276542384, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACBVJ3LBEXRX7UR6DUO5MRLZQPDXRAVCNFSM6AAAAABMFELGWOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDENZWGU2DEMZYGQ . You are receiving this because you are subscribed to this thread.Message ID: @.***>

Avylab commented 3 months ago

Thanks danna , I'm thinking it's really my reference file. I don't why my assembly result in such a wide divergence.

jeffreybarrick commented 3 months ago

I agree with this seems like something is up with your reference assembly. I don't think any assembly that used Illumina reads for polishing could be >1% different at the base pair level, and I wouldn't expect even an assembly of fast-base-called nanopore reads on its own to be that divergent.

Best of luck figuring this out!

I'm going to close this as an issue, but feel free to give updates on what happened.