barricklab / breseq

breseq is a computational pipeline for finding mutations relative to a reference sequence in short-read DNA resequencing data. It is intended for haploid microbial genomes (<20 Mb). breseq is a command line tool implemented in C++ and R.
http://barricklab.org/breseq
GNU General Public License v2.0
147 stars 21 forks source link

Homopolymer filtering with nanopore data #387

Open rowi2024 opened 4 hours ago

rowi2024 commented 4 hours ago

Hello, It is really great that Breseq can now work on Nanopore data! However, since basecalling has improved with recent upgrades to nanopore chemistries and basecalling models, I wonder if you might consider reducing the filtering of the homopolymer regions (allowing larger homopolymers to be queried for mutations). Thanks!!

jeffreybarrick commented 4 hours ago

Yes, this seems reasonable.

If you're running in consensus mode, you should be able to do this with the current breseq version. Add this option in addition to -x to relax the default Nanopore filtering:

--consensus-reject-indel-homopolymer-length 0

However, it looks like you can't currently disable this in polymorphism mode b/c --polymorphism-no-indels gets set.

We can update that in the next version.

Note to self: Also, it looks like the help text for the -x option is a little broken. We could add some advice about making sure the base calling is done in high quality mode or adding back these options.

rowi2024 commented 4 hours ago

Thank you so much! That's great! I will add this and re-run it as soon as my current run of breseq is done:)

jeffreybarrick commented 4 hours ago

Good luck! Let me know if it doesn't work... these sets of options are not well-tested yet.