Martinsos / edlib

Lightweight, super fast C/C++ (& Python) library for sequence alignment using edit (Levenshtein) distance.
http://martinsos.github.io/edlib
MIT License
493 stars 162 forks source link

Overlap Alignments #81

Closed abiswas-odu closed 7 years ago

abiswas-odu commented 7 years ago

Hello,

We are interested in using Edlib for aligning short sequences (e.g. ~1 kb long) on nanopore reads (e.g. a few kbs long). I think most of our alignments will belong to one of two types: infix alignment and overlap alignment.

In the infix alignment, the short sequences are fully contained in the nanopore reads. Edlib can already handle this with the HW option.

Could you suggest a way to use Edlib to perform overlap alignments? In an overlap alignment, the two sequences have overhangs like this:

===ATCGTC GTTATC===

Can Edlib automatically detect which mode would give the best alignment? Or can Edlib just not penalize the gaps at the beginning and the end of both sequences? This should cover both the infix alignment and the overlap alignment.

Martinsos commented 7 years ago

Hi @abiswas-odu , thank you for your interest in Edlib!

As you said, HW option should be used for infix alignment.

Unfortunately, Edlib does not support overlap alignment, due to its definition: Edlib calculated edit distance of two sequences, and solution of overlap alignment for edit distance is trivial and always zero (explained in more details here https://github.com/Martinsos/edlib/issues/54). If you want to use overlap alignment, you need to use aligner that uses different kind of scoring system, e.g. like that used by Smith-Waterman (SW) algorithm. From what you wrote, it sounds to me like SW is what you need: it allows gaps both at the beginning and the end of both sequences.

Here are a few SW libraries that I can recommend:

There are certainly more implementations out there, each with their own set of weaknesses/strengths, I recommend researching them a little bit and choosing the one that fits you best.