Martinsos / edlib

Lightweight, super fast C/C++ (& Python) library for sequence alignment using edit (Levenshtein) distance.
http://martinsos.github.io/edlib
MIT License
493 stars 162 forks source link

Enabling additional equalities for generic sequences #152

Closed mobinasri closed 4 years ago

mobinasri commented 4 years ago

This PR contains the implementation of a disjoint-set data structure and using it to activate additional equalities for generic sequences. The main change is based on the transitive relation, which means that if A is equivalent to B (A ~ B) and B ~ C then A ~ C. If we have such an assumption we can use a disjoint-set structure to map all the elements in the same equivalency set to just one element. Here is the list of main changes:

  1. Implementation of disjoint-set in a separate header file and including it in the edlib source file.
  2. The elements in each set are replaced by their representative element in both query and target sequences. For example if we had A ~ B and B ~ C so all A,B and C would be in the same set and one of them is being used (Actually the root of disjoint set) as the representative. If we choose A as the representative, then we replace B and C by A in our sequences.This process is done in the function, transformSequences().
  3. The class of EqualityDefintion is removed and wherever we were calling areEqual it is now replaced by equality operator ==. It is going to be faster than before.
Martinsos commented 4 years ago

Oh oh, one idea -> what if we let them define the equality function themselves? Then, they could define it as they want, and in certain use cases it could be much faster then the solution with unordered sets/maps.

For example, let's say they have big alphabet, and only one equality: A == B. They could provide us with function { if (x == 'A' && y == 'B' || x == 'B' || y == 'A') then true else false }. While this still means we have to run this function for every equality checks while building Peq, at least it does not necessarily require calculating hashes of each element and looking them up in the map/set, which I believe is more expensive, regardless of time being constant in average. This also allows them to specify some fancy equalities hm. It also allows them to mess things up if the relation the provide is not a proper equality hm.

I am not sure how we would let them provide such function. Would it be implementation of interface that we specified? They could overload operator == on Element, but I am afraid that might be problematic in some other places if we don't expect it (or maybe not?). Hm, also, how would be implement this in Python? They could provide us with lambda, but can we transfer that to C++? I don't think so.

mobinasri commented 4 years ago

@Martinsos My apologies for the late response. Last week was the final week of the Spring quarter and there were a lot of assignments to do.

I thought about the two approaches you said (I mean using a set of pairs or using a hash table of sets). I wrote a script to implement both and compare their performance. My test scenario was that I made an array of EqualityPair similar to what edlib has. Then I generated a lot of random numbers in pairs and check if they are equal using both approaches. The execution time when I used a set of pairs was at least 4 times faster than using a hash table of sets. By the way, I could implement sets of pairs in fewer lines of code. So, I agree with you on using sets of pairs.

About allowing the user to define the equality function, I don't have any idea on how to use lambda in python and define a function based on that in C++. If you know any technique please tell me so maybe we can go that way. I guess this step is critical because this function is going to be called many times in edlib and if the user cannot provide a well-defined function it will harm the performance. I think using std::unordered_set is fine.

mobinasri commented 4 years ago

And one more thing is that I had to use boost/functional/hash.hpp for hashing std::pair and inserting pairs into a std::unordered_set. That's because there is no standard way of computing a hash on a pair.

The declaration of the set is something like this: unordered_set< pair<int, int>, boost::hash< pair<int, int> > > equalitySet;

We have to replace int by Element when we use it for edlib. I got this idea from here: https://stackoverflow.com/questions/15160889/how-can-i-make-an-unordered-set-of-pairs-of-integers-in-c

Martinsos commented 4 years ago

@Martinsos My apologies for the late response. Last week was the final week of the Spring quarter and there were a lot of assignments to do.

I thought about the two approaches you said (I mean using a set of pairs or using a hash table of sets). I wrote a script to implement both and compare their performance. My test scenario was that I made an array of EqualityPair similar to what edlib has. Then I generated a lot of random numbers in pairs and check if they are equal using both approaches. The execution time when I used a set of pairs was at least 4 times faster than using a hash table of sets. By the way, I could implement sets of pairs in fewer lines of code. So, I agree with you on using sets of pairs.

About allowing the user to define the equality function, I don't have any idea on how to use lambda in python and define a function based on that in C++. If you know any technique please tell me so maybe we can go that way. I guess this step is critical because this function is going to be called many times in edlib and if the user cannot provide a well-defined function it will harm the performance. I think using std::unordered_set is fine.

Great job on doing the tests @masri2019! I am slightly surprised that set of pairs is faster hm, I would have thought that there would be less hash collisions if we use map of sets, meaning it is all faster. Maybe it is slower because of memory allocation for sets hm? Anyway, if tests show that set of pairs is faster, great, let's go with that!

Regarding the function in Python - I also don't know an easy way, so I agree, let's drop that for now, that can be an optimization for the future at some moment if we decide it is worth bothering with it.

Regarding hashing of pair -> yes, I thought that might be a problem hm! Using boost normally makes sense, but, in Edlib we have no external dependencies, and adding Boost for something small like this feels like an overkill. I found this link https://stackoverflow.com/a/9729747/1509394 where they describe how to easily implement the same hashing as boost has -> so how about we do it on our own? I checked Boost source and implementation in Boost is still the same as in this SO answer. Although, they also mention that this hashing is not very good, but also that there is no better method out there, so I guess that is ok then.

Btw, as a final thought -> we could have specialization of AdditionalEqualities for char, where we build the table, like we did before. Or also some other kinds of optimizations. But I would also leave that for the future, I just wanted to mention it.

TLDR; -> Great, let's do set of pairs, however let's not use Boost, instead let's copy paste their hashing logic (it should be function hash_combine and then defining the hash struct for pair -> this struct uses hash_combine and then we give it to the unordered_set as second template argument I believe).

mobinasri commented 4 years ago

@Martinsos Great! I used the hash function you sent in my test code. Let me share my code and the result. Here is the test code: https://github.com/masri2019/temporaryCodes/blob/master/edlib/compareSetAndMap/compareSetAndMap.cpp Here is the result which shows that using a set of pairs is nearly 4 times faster: (for 20% expected hit rate) #std::unordered_map -> Finished in 0.051417 seconds #std::unordered_map -> Hit Count / all: 19853 / 100000 #std::unordered_set -> Finished in 0.012612 seconds #std::unordered_set -> Hit Count / all: 19919 / 100000 I'm going to do the similar implementation for edlib. Please tell me if you have any fundamental comments especially on how I implemented the set of pairs. If you think it is OK you can close this PR and I'll come back with a new one.

Martinsos commented 4 years ago

I took a look and looks good! Ok, let's go for it! I think this is good first effort, and if needed we can try optimizing this further in the future.

Btw. one thing I was thinking about for some time is setting up some really nice performance tests, that we could use in the future to optimize parts of Edlib as needed, without having to manually check the speed -> it would be great to have a suite of perpformance tests that target very specific things and that measure how much is specific part of edlib taking, and then making optimization would just be trying couple of approaches and checking out what are the numbers. Ok, certainly not something we should be doing now, but just wanted to share the idea, it would be really cool to have that one day. I did write some perf tests in the past, but that was to compare with other tools, so not really the same thing. I will open an issue for it.

On Sat, Jun 27, 2020 at 10:39 AM Mobin Asri notifications@github.com wrote:

@Martinsos https://github.com/Martinsos Great! I used the hash function you sent in my test code. Let me share my code and the result. Here is the test code:

https://github.com/masri2019/temporaryCodes/blob/master/edlib/compareSetAndMap/compareSetAndMap.cpp Here is the result which shows that using a set of pairs is nearly 4 times faster: (for 20% expected hit rate)

std::unordered_map -> Finished in 0.051417 seconds

std::unordered_map -> Hit Count / all: 19853 / 100000

std::unordered_set -> Finished in 0.012612 seconds

std::unordered_set -> Hit Count / all: 19919 / 100000

I'm going to do the similar implementation for edlib. Please tell me if you have any fundamental comments especially on how I implemented the set of pairs. If you think it is OK you can close this PR and I'll come back with a new one.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Martinsos/edlib/pull/152#issuecomment-650524236, or unsubscribe https://github.com/notifications/unsubscribe-auth/AALXFB7IGJQVBHPRATYRQ5LRYWV47ANCNFSM4NJXI77Q .

Martinsos commented 4 years ago

Ok, created an issue here: https://github.com/Martinsos/edlib/issues/153 .

On Sat, Jun 27, 2020 at 1:16 PM Martin Šošić sosic.martin@gmail.com wrote:

I took a look and looks good! Ok, let's go for it! I think this is good first effort, and if needed we can try optimizing this further in the future.

Btw. one thing I was thinking about for some time is setting up some really nice performance tests, that we could use in the future to optimize parts of Edlib as needed, without having to manually check the speed -> it would be great to have a suite of perpformance tests that target very specific things and that measure how much is specific part of edlib taking, and then making optimization would just be trying couple of approaches and checking out what are the numbers. Ok, certainly not something we should be doing now, but just wanted to share the idea, it would be really cool to have that one day. I did write some perf tests in the past, but that was to compare with other tools, so not really the same thing. I will open an issue for it.

On Sat, Jun 27, 2020 at 10:39 AM Mobin Asri notifications@github.com wrote:

@Martinsos https://github.com/Martinsos Great! I used the hash function you sent in my test code. Let me share my code and the result. Here is the test code:

https://github.com/masri2019/temporaryCodes/blob/master/edlib/compareSetAndMap/compareSetAndMap.cpp Here is the result which shows that using a set of pairs is nearly 4 times faster: (for 20% expected hit rate)

std::unordered_map -> Finished in 0.051417 seconds

std::unordered_map -> Hit Count / all: 19853 / 100000

std::unordered_set -> Finished in 0.012612 seconds

std::unordered_set -> Hit Count / all: 19919 / 100000

I'm going to do the similar implementation for edlib. Please tell me if you have any fundamental comments especially on how I implemented the set of pairs. If you think it is OK you can close this PR and I'll come back with a new one.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Martinsos/edlib/pull/152#issuecomment-650524236, or unsubscribe https://github.com/notifications/unsubscribe-auth/AALXFB7IGJQVBHPRATYRQ5LRYWV47ANCNFSM4NJXI77Q .