Martinsos / edlib

Lightweight, super fast C/C++ (& Python) library for sequence alignment using edit (Levenshtein) distance.
http://martinsos.github.io/edlib
MIT License
506 stars 165 forks source link

Counts in cigar are off when sequences contain characters with diacritics #79

Closed jvdzwaan closed 7 years ago

jvdzwaan commented 7 years ago

I'm using edlib to compare natural language strings (so thank you for not restricting the alphabet!). My goal is to have character by character alignments.

If the strings contain characters with diacritics (e.g., ë) the counts in the cigar are off. This probably has to do with the fact that characters with diacritics are encoded using multiple characters (e.g., ë is c3 ab in utf-8).

Example program:

#include <stdio.h>
#include <stdlib.h>
#include "edlib.h"

int main() {
  EdlibAlignResult result = edlibAlign("aabëaba", 8, "aaëaa", 6, edlibNewAlignConfig(-1, EDLIB_MODE_HW, EDLIB_TASK_PATH));
  char* cigar = edlibAlignmentToCigar(result.alignment, result.alignmentLength, EDLIB_CIGAR_EXTENDED);
  printf("%s\n", cigar);
  free(cigar);
  edlibFreeAlignResult(result);
}

Output: 2=1I3=1I1= (note that if the sequence lengths are set to 7 and 5 respectively, the output is 2=1I3=1I)

Is there a way to take into account characters that are encoded using multiple characters when doing the actual sequence alignment, or will I have to compensate for that when interpreting cigars?

Martinsos commented 7 years ago

Hi @jvdzwaan, glad to see that you use edlib for your project :)!

I get what the problem is. Internally in edlib, I just deal with chars as with numbers. Therefore, unfortunately there is no support currently for letters that take two chars.

I understand that you could tweak the cigar, but I feel it would be better to try and solve the problem before alignment even happens - that way you can be sure everything is working as it should. The idea is: better prevent it then fix it :).

Ok, so here is how I would suggest approaching this. As I said, edlib works with chars as with numbers, there is no preprocessing done or any kind of magic. It basically just compares them to see if they are equal or not. Since chars are 8-bit, that means edlib can work with any alphabet that has up to 256 different letters. So the idea is that if you can map your alphabet to that space, you can use Edlib without any problems. I guess in your case, that would mean mapping those special diacritic characters to some numbers from -127 to 127 (maybe +-1 on boundaries, smt like that :)). I am not sure what are the other characters that you expect to appear in your alphabet, but since it is natural language strings you are most likely not expecting any characters with decimal values from 0 to 31, so you could just map the diacritic characters in that area. Or you could just map them to negative numbers.

Regarding your example, we could define the following mapping for diacritic characters:

    ë -> 1
    ó -> 2
    ú -> 3
 ...

Before using edlib, we would turn "aabëaba" from [97, 97, 98, ?, ?, 97, 98, 97] to [97, 97, 98, 1, 97, 98, 97] and "aaëaa" from [97, 97, ?, ?, 97, 97] to [97, 97, 1, 97, 97] and use those arrays of chars with edlib. Then, you will get results as expected.

I hope this helps! Let me know if something is unclear.

jvdzwaan commented 7 years ago

I hadn't considered that possibility. Excellent suggestion, thanks!

Martinsos commented 7 years ago

No problem! Closing the issue - feel free to create a new one if you encounter the same or other problems.