ComparativeGenomicsToolkit / taffy

This is a library C/Python/CLI for working with TAF (.taf,.taf.gz) and MAF (.maf) alignment files
MIT License
23 stars 3 forks source link

Fix bug in greedy dupe filter #32

Closed glennhickey closed 8 months ago

glennhickey commented 10 months ago

This is a nasty one that came up in https://github.com/ComparativeGenomicsToolkit/cactus/issues/1201

taffy norm -d greedily selects paralogous rows to remove in order to tackle block fragmentation. Selected rows are deleted and the links are cut between them and their left and right neighbours. But it looks like I left left_gap_sequence field alone on the right neighbour. This situation (left_gap_sequence but not left link) apparently left following code in a state where it would mis-assign the length field to the merged block.

This can happen any time taffy add-gap-bases and taffy norm -d are used in conjunction.

The patch here is just to remove the gap sequence along with the link to the previous row. Unfortunately, it calls into question the validity of MAFs previously created with cactus-hal2maf --filterGapCausingDupes .

Definitely would be good to have a taffy validate that can get run during cactus-hal2maf to prevent future cases like this.