Martinsos / edlib

Lightweight, super fast C/C++ (& Python) library for sequence alignment using edit (Levenshtein) distance.
http://martinsos.github.io/edlib
MIT License
506 stars 165 forks source link

Sometimes alignments start/end with insertions #32

Closed Martinsos closed 9 years ago

Martinsos commented 9 years ago

For data at https://drive.google.com/folderview?id=0B3-AVv8sCms8TjB3S0Zfb3RrLTA&usp=sharing and command src/aligner -a HW -f -c test_data5/read.fasta test_data5/sample_reference.fasta alignment starts with insertions, investigate this!

Martinsos commented 9 years ago

I found the cause for this! When looking for alignment, I first find end location, then I find start location, and then I find alignment. When looking for start location I was taking the one closest to end location and that is what causes this problem. If there is an alignment starting with insertions, there is always the same alignment but starting with mismatches (unless target is too short), and of course such alignment will obviously start sooner. I fixed this by taking the start location that is as far from end location as possible, because this ensures there is no longer alignment, which ensures it will not start with insertions to target if it can start with mismatches.

Martinsos commented 9 years ago

Resolved with 89d47ffc4aac7a33fd119073b8c4466c99a56340.