EvolBioInf / fur

Find Unique genomic Regions
29 stars 3 forks source link

unbias match lengths by skipping terminal mismatch #17

Closed kloetzl closed 6 months ago

kloetzl commented 6 months ago

I believe this is wrong. After a match of say ten bases we know that the eleventh is a mismatch. Hence if we are starting from there we are biasing ourselves to a short random match. Instead we need to skip that base and start our next match one base beyond. This is how it is implemented in Phylonium:

https://github.com/EvolBioInf/phylonium/blob/master/src/process.cxx#L280-L281

andi: https://github.com/EvolBioInf/andi/blob/master/src/process.c#L195-L196

(didn't actually test this code, lol).

haubold commented 6 months ago

You are right, we should skip beyond the mismatch that terminates the match, so I'll merge your pull request. However, in Chunk 23c, , we now also have to skip by one more position, lest we get stuck in an infinite loop.

haubold commented 6 months ago

I've now completed the change, thank you very much for catching this.

kloetzl commented 6 months ago

Glad I could help.