Martinsos / edlib

Lightweight, super fast C/C++ (& Python) library for sequence alignment using edit (Levenshtein) distance.
http://martinsos.github.io/edlib
MIT License
493 stars 162 forks source link

Generic Edlib #145

Closed mobinasri closed 4 years ago

mobinasri commented 4 years ago

1) Supporting C: As you suggested, I made C-Wrappers for the main functions and classes in edlibGeneric.h (it is the C++ header file that contains the templated version of edlib). I didn't change the names of the functions and classes, so they have their original names but in order to avoid conflicts, I put the generic version in the namespace of "edlibGeneric". The C-Wrapped functions are having exactly the same appearances as before but this time they are calling the generic edlib under the hood. ( data type as char ). The header file of C-Wrapped functions is in the directory /edlibGeneric/include/edlibChar.h. Their definitions (in which equivalent generic edlib functions are called ) are in the directory /edlibGeneric/src/edlibChar.cpp.  2) Python API: I modified the python API to support generic sequences. I was not familiar with Cython so it took me some time to learn it, especially how they support C++ templates. I copied your .pyx and .pxd files into another directory and modified them there ( /pythonGeneric ). In your Cython code, you mapped the given sequences to char-type indexes. In the modified code, they are mapped to int-type indexes so the alphabet size is not restricted to 256 anymore. Then edlibGeneric is used for alignment (not generic in reality). You can use your test and performance files to check the correctness and its performance. The output and input objects are no longer strings but lists since it enables us to accept lists of any type. Python iterates over string and lists in the same manner, so there is no problem with passing strings as inputs. 3) Additional Equalities: I removed the boolean matrix that you used for applying additional equalities since for large types (like int) it would be extremely space-consuming. To enable the additional equalities for the generic edlib, I used a disjoint-set structure since by that we can map the equivalent symbols to a single symbol (one of the symbols in each equivalency set). I used this data structure in the function, "transformSequences", in which the sequences are mapped to their indexes (equivalent symbols have the same index now). So, before starting the alignment process we transform the sequences in a way that there is no need to keep track of the object of "EqualityDefintion" and pass it to different functions.This feature is also added to the python API.

Martinsos commented 4 years ago

We agreed to do this through many smaller PRs, so I am closing this one!