EvolBioInf / fur

Find Unique genomic Regions
29 stars 3 forks source link

unbias match lengths by skipping terminal mismatch #17

Closed kloetzl closed 10 months ago

kloetzl commented 10 months ago

I believe this is wrong. After a match of say ten bases we know that the eleventh is a mismatch. Hence if we are starting from there we are biasing ourselves to a short random match. Instead we need to skip that base and start our next match one base beyond. This is how it is implemented in Phylonium:

https://github.com/EvolBioInf/phylonium/blob/master/src/process.cxx#L280-L281

andi: https://github.com/EvolBioInf/andi/blob/master/src/process.c#L195-L196

(didn't actually test this code, lol).

haubold commented 10 months ago

You are right, we should skip beyond the mismatch that terminates the match, so I'll merge your pull request. However, in Chunk 23c, , we now also have to skip by one more position, lest we get stuck in an infinite loop.

haubold commented 10 months ago

I've now completed the change, thank you very much for catching this.

kloetzl commented 10 months ago

Glad I could help.