Martinsos / edlib

Lightweight, super fast C/C++ (& Python) library for sequence alignment using edit (Levenshtein) distance.
http://martinsos.github.io/edlib
MIT License
505 stars 165 forks source link

Modifying tests and aligner-app #150

Closed mobinasri closed 4 years ago

mobinasri commented 4 years ago

This pull request contains two main changes;

  1. Making aligner app back to work. Currently, it only supports character sequences.
  2. Modify the previous tests to generate random sequences with any desired Element type and check their correctness and execution time. To achieve this aim, three template parameters are added to the previous random test functions, so now we can set the Element, AlphabetIdx, and also the alphabet size for random tests. A new test is added for checking the AlphabetTooBigException. It has to be thrown when the alphabet size of input sequences is larger than the maximum value that can be stored in AlphabetIdx.
Martinsos commented 4 years ago

@masri2019 looks good to me! I approved, although I asked again about that possibly redundant duplicated declaration, can't we remove that?

Also, let's do some squashing before we merge!

Martinsos commented 4 years ago

I merged the PR - ok great progress :)!

@masri2019, here is our initial PRs plan, just as reminder:

(PR #1):
Replacing edlib.h and edlib.cpp with edlib.hpp, which implements cpp generic version of edlib. (MORE ADVANCED: instead of just edlib.hpp, we could have edlib.hpp and edlib.tpp, where edlib.hpp corresponds to edlib.h and has #include edlib.tpp at the bottom, while edlib.tpp corresponds to edlib.cpp. While it is same for compiler, this provides some kind of interface vs implementation separation for developer, plus it will allow Github to show modifications in nicer way).
This PR does not implement C interface yet.
It modifies EqualityDefinition minimally to make it work (no disjoint set) -> either let it eat too much memory for now, or just remove the matrix and let it directly search through list of additional equalities in areEqual method.
The library should compile, but the rest does not have to. Tests and aligner and all the other parts of the cpp code -> if it is a lot of work to get them working, and I think it is, we leave them for later.
Python binding will be broken -> that is also fine.
(PR #2):
We update the rest of the cpp code: tests, aligner, ... to work with the changes made in PR #1.
(PR #3):
We do implementation optimizations / details, if needed: certainly EqualityDefinition, maybe smth else also.
We have tests working now, so we can use them.
CPP codebase is now fully modified and working as it should.
We also do performance tests here, by checking out different branches (old code and new code) and running appropriate tests.
If performance is worse then before, we do needed optimizations.
(PR #4):
We update documentation for CPP version (README.md).
(PR #5):
We modify python binding so it works with the new CPP implementation, and we also update its documentation.
(PR #6):
We consider if we should be adding C wrappers or not. If yes, we add cedlib.h and cedlib.cpp files, which will be small and short. Possibly we add a test or two, since it is merely using the CPP interface and most of the stuff is checked in compile time. We don't do performance testing, since it is not needed. We also add docs for it.
Again, we might just skip or postpone this step.
THE END:
We do final polishing, check that CI is passing, possibly run some final performance checks, and finally release new version (both cpp/c and python), which version bumped to 2.0.0 due to new interface.

So the idea for the next PR is to get EqualityDefinition working, also polishing anything that is left in the CPP codebase (if there is an ything, I think maybe even not), and doing performance tests to make sure we did not loose any performance.

mobinasri commented 4 years ago

@Martinsos Great!

For getting EqualityDefinition back to work, there are two approaches that I am thinking about. Before that, I want to tell you what I understood about how it can affect the run time. Well, EqualityDefinition has the method, areEqual. It is called in the function, buildPeq which pre-processes the query seq. It makes a table in which rows are query elements and the columns are alphabet elements. So by increasing the alphabet size making this table consumes more time and space (linearly increases). areEqual is called once for making each entry of this table. So the number of times that we call this method increases linearly by alphabet size. I can remember that while I was doing some performance tests I realized that about 10 percent of the execution time, the program is busy making this table. So we have to implement areEqual in a time efficient manner. The original edlib was constructing a matrix and accessing each entry is pretty fast but now because of the large alphabet size we cannot make such a matrix anymore. The two approaches:

  1. The first one is based on the transitive relation, which means that if A is equivalent to B (A ~ B) and B ~ C then A ~ C. If we have such an assumption we can use Disjoint-Set structure to map all the elements in the same equivalency set to just one element. That way, we can replace the elements in each set by their representative element in both query and target sequences. for example if we had A ~ B and B ~ C so all A,B and C would be in the same set and we can choose one of them (Actually the root of disjoint set) as the representative. If we choose A as the representative, then we replace B and C by A in our sequences. We can do the whole process in the function, transformSequences. and there is no need to change areEqual. We can even remove it from the code and wherever we are calling areEqual we can replace it by equality operator ==. It is going to be very fast. All these are based on transitive relation, which I am not sure if it is what the users are assuming.

  2. The second approach is to make a hash table in which each key is an element that has at least one equivalency relation. Each value can be a set of the equivalent elements. When areEqual is called, it is given two elements. It searches the hash table to see if the first element is in the keys. If it is not it just returns the output of == otherwise it looks at the value, which is a set here. If the second element is in the set it returns true otherwise false. For example we have; A ~ B, B ~ C, D ~ E We would have such a table: A : {B} B : {A, C} C : {B} D : {E} Maybe we can use std::unordered_map for the hash table and std::unordered_set for the set. We can also use other implementations. I looked at this link that compares some of these implementations. https://tessil.github.io/2016/08/29/benchmark-hopscotch-map.html We are not going to write or delete the table after we made it so, we have to concern about the reads and read misses. According on this link, tsl::hopscotch_map and tsl::robin_map can also be other fast options. Of course it is going to be slower than the original matrix-based implementation of areEqual but it is much more memory efficient. On the other hand, if go with the first approach I think we would save a lot of time and memory. But that needs to accept the transitive equivalency.

Anyway, Please tell me your opinion and if you have any other approaches in your mind.

mobinasri commented 4 years ago

@Martinsos Hi! I thought that maybe you missed my previous comment about how to get EqualityDefinition back to work. It would be great if you could send me your feedback.

Martinsos commented 4 years ago

@masri2019 Sorry for slow feedback, I had a couple of busy days - I put some gifs/emojis as apology :D. Great job really on this analysis, I enjoyed reading it :). There is nothing I can add to this, I think you covered everything!

greatjob

I think we can go with the assumption of transitivity, the other option is just to complex to justify without a strong use case / feature request. So, I am certainly for approach 1.

Unfortunately, we can't avoid building Peq while maintaining speed (at least I don't have any straight forward ideas right now, it is needed as a preprocessing step), but I believe for most use cases, its size should be manageable compared to the size of sequences in question. We could in future maybe isolate this step as a preprocessing step, so that user can calculate only once per query and use it on different targets that are in the same alphabet, but that is a whole different thing, not something to think about right now.

I believe complexity of building these disjoint sets should not be too big, so that is also fine.

Only thing I wonder about is, are we losing any information? Since we are actually reducing the size of alphabet and merging multiple letters into one letter, that means that down the "pipe" we have less knowledge than we had before -> for example, if we want to print some of them, we will not be able to map them back to original letter. On the other hand, I took a quick look and I don't see any place where this kind of information is needed, so I think it should be ok.

To summarize, great analysis :tada: , I agree with everything and love approach 1 and think we (you) should go with that! :rocket: :rocket:

mobinasri commented 4 years ago

@Martinsos Great! Thanks for your energetic response! So I'll start to implement a class template (I guess only one template parameter is needed and that should be just Element) for disjoint set in a separate header file (or I can do it in edlib.tpp?), include it in edlib.tpp and use it in transformSequences() before mapping to alphabet indexes. I also remove EqualityDefintion cause we don't need it anymore. As you said I think we don't lose any information since the original sequences are accessed no where after being mapped to the alphabet indexes. The output of edlibAlign() is the alignment and no modified sequence is returned so it seems that this approach won't disrupt anything. I'll send you the next PR.

Martinsos commented 4 years ago

@masri2019 sounds good! Should it be in seaprate file or not -> well since everything is already in edlib.tpp, I guess it will not make much of a difference :D. That should really be split one day. But if you feel you can nicely separate, I am also ok with that, whatever you prefer. Looking forward!