ComparativeGenomicsToolkit / taffy

This is a library C/Python/CLI for working with TAF (.taf,.taf.gz) and MAF (.maf) alignment files
MIT License
23 stars 3 forks source link

Fix index query bug #52

Closed glennhickey closed 4 months ago

glennhickey commented 4 months ago

Taffy index queries were pretty broken, as @benedictpaten pointed out:

./bin/taffy view -i ./tests/evolverMammals.maf -c  > ./tests/evolverMammals.taf.gz
./bin/taffy index -i ./tests/evolverMammals.taf.gz

./bin/taffy view -i ./tests/evolverMammals.taf.gz -r Anc0.Anc0refChr0:400-405
Assertion failed: (strlen(column) == column_length), function get_bases, file taf.c, line 125.

/bin/taffy view -i ./tests/evolverMammals.taf.gz -r Anc0.Anc0refChr0:410-413
Assertion failed: (*row != NULL), function parse_coordinates_and_establish_block, file taf.c, line 68.

Everything worked fine when indexing and querying the .maf, which is probably why it's taken this long to come up.

Anyway, the problem seems to be that the query function works by:

The issue that that TAF parsing is dependent on the previous block. So if it sliced before reading the next block, and that slice removes a row (because only gap characters were left after slicing), then the parser aborts when trying to apply the previous coordinates to the next coordinates.

Anyway, the fix is pretty simple: only slice the first block after the next one has been parsed. I added these two problem regions to the tests, and added a test that does a bunch of taf queries and makes sure the output's same as maf.