barricklab / breseq

breseq is a computational pipeline for finding mutations relative to a reference sequence in short-read DNA resequencing data. It is intended for haploid microbial genomes (<20 Mb). breseq is a command line tool implemented in C++ and R.
http://barricklab.org/breseq
GNU General Public License v2.0
152 stars 21 forks source link

Unidentified mutation although present #380

Closed srh-bzd closed 2 months ago

srh-bzd commented 2 months ago

Hello all,

I did an alignment of long-reads (~5000nt) against a reference with breseq version 0.36.1. Here is the command line:

breseq -r "$REFERENCE" -p --polymorphism-frequency-cutoff 0.05 --consensus-frequency-cutoff 0.95 -j "$THREADS" -o "$OUTPUTDIR"/breseq_output/"$file_name" "$READSDIR"/"$file_name".fastq

Basically, all the reads in the FASTQ file are on the forward strand. But, for breseq to work properly, I made sure that 50% of the reads were on the sense strand and 50% on the anti-sense strand.

breseq worked well but no mutations were found (either in "mutation predictions" or "marginal predictions"). However, I know that at a given position in my reference I have polymorphisms in my sample. In fact, when I looked at the BAM alignment created by breseq in IGV I could see that at the position of interest there was polymorphism: 31 reads have the reference nucleotide C (39%), 45 the alternative nucleotide G (56%) and 4 the alternative nucleotide A (5%).

image

I tried to find out for myself why the mutation had not been identified even though it was present by taking a look at the "RA prediction options and flowcharts" section of the breseq documentation (https://barricklab.org/twiki/pub/Lab/ToolsBacterialGenomeResequencing/documentation/methods.html) but I have to admit I didn't understand.

Do you have any idea why breseq didn't identify this mutation?

Thanks for any help

dannagifford commented 2 months ago

A guess, you need to use the nanopore long reads option. breseq was designed for short reads and I think the statistical framework possibly breaks down when the reads are long? The option IIRC simulates short reads from the long reads. Hopefully Jeff will correct me if I'm wrong!

-x, --nanopore

Set read alignment and mutation calling options for Nanopore sequencing data. Important: no indel mutations will be called in homopolymer repeats of 4 or more bases with this option.

If you put all of one allele on sense and all on anti sense I think this will also get rejected by the strand bias check.

srh-bzd commented 2 months ago

Thank you for your help Danna.

I know that the latest version of breseq has an option for long-reads. Unfortunately, I had to use the old one..

"If you put all of one allele on sense and all on anti sense I think this will also get rejected by the strand bias check." Hmm, I thought of that too but in other samples long-reads despite the 50/50 the mutation is well identified.

srh-bzd commented 2 months ago

Well, out of curiosity, I ran breseq again with the same command line, but this time version 0.38.1 and with the --nanopore option. The mutation is still not identified.

jeffreybarrick commented 2 months ago

I think this mutation must be getting rejected by one of the filters.

It's possible that it is one that looks at the quality scores of the bases supporting the variant versus the reference and rejects the variant if it has an unusually bad distribution of quality scores.

If you look at the "Marginal Evidence" tab, you should see the top 20 RA evidence items that didn't pass. Maybe your mutation or a similar one shows there? If so, click on it and there should be one or more yellow rows at the bottom of the table at the top that explain why it is being rejected.

Hopefully, this gives us a clue!

srh-bzd commented 2 months ago

Hello Jeffrey,

Thank you very much for your reply and your help.

Unfortunately, I don't have anything in the "marginal mutations" tab (or the "mutation predictions" tab), regardless of the version of breseq (0.36 / 0.38) and with or without the --nanopore option...

Also, with regard to the quality score of the database, I don't think this is the case as we work with CSS reads from PacBio.

jeffreybarrick commented 2 months ago

Can you email me at the address in the breseq header to set up a way to share an example of your input files, so I can investigate further?

srh-bzd commented 2 months ago

I do this now by indicating the title of the issue in the subject line of the e-mail

jeffreybarrick commented 2 months ago

Here's the filter you are running into!

By default, breseq requires there to be two reads on both strands supporting a variant for it to be predicted in polymorphism mode. Apparently this filter happens so early in the process that it doesn't even get reported as marginal evidence.

Even in your 50-50 dataset, there are only reads on one strand overlapping the region with your mutation of interest. I'm not sure how you made it 50-50, but it doesn't look like it's uniform across the genome.

In any case, you can disable this filer by adding this option to your command line:

--polymorphism-minimum-variant-coverage-each-strand 0

In my output with this option, I see your mutation of interest. Hope this helps!

srh-bzd commented 2 months ago

Hello Dr Barrick,

You're absolutely right, my 50-50 is not happening! So the problem is not with breseq.

In any case, I'm glad to learn about the --polymorphism-minimum-variant-coverage-each-strand 0 option because rather than manipulating the files, I'll apply it in future.

Thank you so much for your help and your time! Incredible