align two arrays of strings

Martinsos / edlib

Lightweight, super fast C/C++ (& Python) library for sequence alignment using edit (Levenshtein) distance.

http://martinsos.github.io/edlib

MIT License

492 stars 162 forks source link

align two arrays of strings #189

Closed gorliver closed 2 years ago

gorliver commented 2 years ago

Thank you for the fast alignment tool! I wonder can edlib align two arrays like this: query=["TCA",CAT","ATG","TGC"] target=["TCA","CAA","AAC",ACC"]

Martinsos commented 2 years ago

Thanks!

No, unfortunately it doesn't support N vs N alignment, although it has been discussed here and might be something to add in the future.

You could use it though to do 1 on 1 alignment N times.

Actually wait, maybe I got you wrong -> do you want to calculate alignments of a cartesian products of these two, which means all combinations (16 of them), or do you want to treat elements of array as single letters? What does it mean that you want to align those two arrays, what is the result you would expect?

gorliver commented 2 years ago

Sorry for the confusion. I actually want to treat elements of the array as single letters? That is, "TCA" in query matches "TCA" in target, while "CAT" in query mismatch "CAA" in target.

Martinsos commented 2 years ago

Got it! Unfortunately that is not supported right now, although there is an effort to make edlib more generic so it can work with any kind of inputs: https://github.com/Martinsos/edlib/tree/gen-seqs , and while significant progress was made, we kind of dropped it in the last half a year or more. I believe it will be picked up again but I can't say when to expect it done.

What is the size of the arrays we are talking about, how many elements?

gorliver commented 2 years ago

Looking forward to this feature :-) the size of the arrays ranges from several to several thousand elements. What I'm trying to do is align the "words" from two sequences rather than every single character. Do you have any recommendations for this application? Thank you!

Martinsos commented 2 years ago

I wouldn't have a specific advice unfortunately as I am not really in the bioinfo field recently, but once we have that feature it would certainly be a perfect solution!

Btw, there is a trick you can do to use Edlib right now for this, IF (and that is a big if) your "alphabet" is <= 256 in size. This means that if you are using e.g. words of length 3, and only possible characters are A, C, T and G, then there is only 444=64 possible different words, and in that case that will work. So what you would do is, you would manually assign a number to each possible word in your alphabet, and you would use that to transform your sequences from array of strings into array of numbers (chars actually, because chars are really numbers from 0 to 256). And then you can run edlib on that! Is there a chance this could be a solution for you?

gorliver commented 2 years ago

In my real work, the words can range from six chars and up to 20 chars. In most cases, there are around two thousand words that need to be compared. It seems I have to rely on your next release. It's a nice trick though. I will do that on the comparisons which have less than 256 words. Many thanks!

Martinsos commented 2 years ago

Got it, in that case this will not work. Yes, we need to get that out! But it is a really big change and I am not actively pushing it so it is taking time -> I hope we do get it done soon though. Thanks!