Identify linked mutations from mapped reads

spleonard1 commented 1 year ago

I'm butchering breseq's intended use case and identifying gene mutants that arose during high throughput gene variant synthesis and tracking their abundance over a short, selective time course (< 48 hours). In many cases there are multiple sets of linked mutations, which can clearly be seen from the read mapping evidence.

Is it possible to identify which mutations occur together on a single read? Does breseq keep track of which unique reads support particular mutation calls? Right now I am using some frequency correlations to loosely link mutations, but it would be nice to parse which reads support which mutations to confidently link them.

I have attached a couple representative pictures. Not a bug, just a discussion / feature request. Thanks!

jeffreybarrick commented 1 year ago

breseq does not track linkage of mutations by read—not even in simple cases where there are base substitutions side-by-side (which is annoying). I can imagine a post-processing step that could go back and do this, at least in simple clear-cut cases like this.

If someone wanted to add this to breseq, they could pilot the step by making a program parse the output reference.bam file and look at the read alignment columns referred to by the RA evidence items that are within one read length of one another and counting how many times mutations are and are not within the same read. There could be some new field in the output GD file like "haplotype=XXXX" that could be used to group linked mutations.

Since this is unlikely to happen in the near future, maybe you could look into haplotype reconstruction programs used for virus genomes (and mixtures of those) to see if any of them can give you this kind of output?

spleonard1 commented 1 year ago

Oooh that’s a good idea re virus haplotyping approaches, thanks!

barricklab / breseq

Identify linked mutations from mapped reads #357