Martinsos / edlib

Lightweight, super fast C/C++ (& Python) library for sequence alignment using edit (Levenshtein) distance.
http://martinsos.github.io/edlib
MIT License
493 stars 162 forks source link

Supporting Generic Sequences #148

Closed mobinasri closed 4 years ago

mobinasri commented 4 years ago

The first PR to make edlib support generic sequences. The main changes:

  1. Replacing most of the functions and classes by their templates with two type arguments, Element and AlphabetIdx. (Element is the type of elements in the given sequences, AlpahbetIdx is the type of indexes for the alphabet )
  2. Commenting and disabling the parts related to additional equalities temporarily. (to be added later)
  3. Putting the code in the namespace of "edlib". The implementation details of edlib.tpp are put in the nested namespace of "internal" to make them hidden from the user program.
  4. Adapting the character-based tests to the above changes
Martinsos commented 4 years ago

@masri2019 , I see you responded to many comments but have not pushed any new commits -> you do plan to push the fixes for the comments on this PR right, in form of additional commits?

Martinsos commented 4 years ago

@masri2019 I see you added a lot of commits, great -> let me know when you added everything and I will take another look.

mobinasri commented 4 years ago

@Martinsos I think I added everything that you said and we discussed.

mobinasri commented 4 years ago

@Martinsos I did your comments. I also wanted to tell you about some of the suggestions that Paolo gave me while he was experimenting. Please tell me your opinion.

  1. Using classes instead of structs and class enums instead of enums; He said that using structs forces the user to do initialization and termination functions instead of relying on constructor and deconstructors. If the user forgets this, you can get undefined behavior (e.g. EdlibAlignConfig may have an uninitialized pointer)
  2. Error messages from the linker about duplicate definitions of some Edlib functions; He said that after including edlib.hpp in two different places he got such errors because of the functions which are not template. (e.g edlibNewAlignConfig(), edlibAlignmentToCigar()). They must have inline specifier to fix this error.
mobinasri commented 4 years ago

@Martinsos Nice Comments about making edlib more CPPish. I just wanted to let you know those feedbacks from Paolo. I think you are right, let's focus on generic features for now.

mobinasri commented 4 years ago

@Martinsos Based on what you said about testing "multiple definitions", I made a new directory named testMultiDefinition in which there exist two source files. Both of them include edlib. One of them (staticTestLib.cpp) is going to be a static library that has to be linked to the other file (useStaticTestLib.cpp). Since both of them include edlib, if there exist multiple definitions of any function in edlib it will be announced by the linker. So, if we get errors during compilation it has to be something wrong with it.

Martinsos commented 4 years ago

@masri2019 Great! So, I again left one or two comments, but I approved the PR! So once you do those (they are tiny), please squash all the commits into one big commit (with git rebase -i Martinsos:gen-seqs or smth like that), and then I will merge it.

Martinsos commented 4 years ago

@masri2019 great, please just squahs it all as I describe above and I will merge!

Martinsos commented 4 years ago

@masri2019 all is great except commit does not have a meaningful message! Put a meaningful message, and also make description shorter and more meaningful (or just remove it if you don't think it is needed). Suggestion for commit message: "Rewrote edlib library to accept generic sequences, changes are not fully propagated through codebase.". Description could be: "C interface is gone for now. Some dependent code is not compiling / working (python binding, some tests, aligner, ...). Documentation is not updated."

mobinasri commented 4 years ago

@Martinsos When shall we start the next PR? As you said earlier, we can work on tests, aligner or even finding an efficient implementation for dealing with additional equalities. Please tell me if you have any comments on the next PR.

Martinsos commented 4 years ago

@masri2019 sorry I was waiting for you to squash this one and did not notice you already did it! Yes of course let's do the next PR :)!

Previously I suggested PR #2 to be taking care of the rest of the cpp code (not python) compiling and working (aligner, tests, ...), and then in PR #3 taking care of Equalities and similar. I am still ok with that, should we do it in that order? If you want to flip it that is also all right, I was just thinking it might make sense in this order so that we have tests to help us when implementing additional equalities.

Maybe next step is: you presenting me the "draft" of the next PR, which would just be short explanation of which parts exactly would be taken care of, and then I say "sounds good" and maybe point to one or two things if needed, and we go from there.

Martinsos commented 4 years ago

Wohoo we merged this one :D!

mobinasri commented 4 years ago

@Martinsos Great! As you said we can start working on tests and the aligner app in the next PR.

  1. For modifying aligner we have two approaches; First, we can call edlib functions having char as the element type. In that case, there is no need to change the output specifically, the NICE format. The other approach is to increase the alphabet size having something like int as the element type. This allows us to get integer sequences but we have to modify the NICE format a little bit since each element is a number and numbers need more than one character to be displayed. We have to put a sufficient amount of space between the numbers to make it appear nice! Or maybe we can add an option if the input is in char or integer? and use the corresponding NICE format in the output.

  2. For tests, I think we can replicate the previous test which is done on random char sequences this time for sequences with large alphabets. To do this, we can generate random sequences with large alphabets (We have to specify how large it can be. Maybe we can do it a range of alphabet sizes) and call edlib functions having int as the element type. Then the execution time will be compared to the simple dynamic programming code as it's written already (it also checks its correctness). I think it is better to add these tests to the existing test file. In that case, we have to distinguish execution times for char and int sequences in the output.

Please comment on these ideas!

Martinsos commented 4 years ago

@masri2019

  1. Mmmm, ok you identified that right, the main problem is output -> we don't know how to define it. I think we can argue that if alphabet is larger than 256, output is most likely going to be mess anyway hm, so probably it makes most sense to just not bother with it. I would suggest that we just make aligner work as it was working so far, which means with "char", and that is it -> people are anyway coming for libraries, not for aligner really, aligner serves more as a demo. What do you think? If there will be feature requests for aligner to add larger alphabet, we can always do it later.

  2. Sounds good! Could we make function runRandomTests generic, and have it take Element and AlphabedIdx, same as edlib does? Then it would propagate those to whatever calls to edlib it does. And random sequences it generates would include all the values that can fit inside AlphabetIdx (or we could have third template parameter, AlphabetSize, to specify max value for sequence elements). Then we can just call it with different template arguments and that is it. Ok, and we will probably want to extract those sequential calls to runRandomTest from main into a function that is also templated, and then we will be calling that function in the main multiple times with different template arguments. Yes, I agree we just do it in this file.

I think we will also want to add non-random test that checks that exception is thrown if alphabet is too big. Hmmm, any other special cases? I don't think so. What might also be good is having those non-random tests also run edlib with different template parameters, to ensure all that still holds in all conditions. But ok, that might be somewhat too much work for now so we could maybe leave that for later (maybe put a TODO).

mobinasri commented 4 years ago

@Martinsos Sounds good! All agreed. I'll start and send you the next PR for review in the next few days.

Martinsos commented 4 years ago

Awesome :)!

On Thu, Apr 16, 2020 at 11:07 AM Mobin Asri notifications@github.com wrote:

@Martinsos https://github.com/Martinsos Sounds good! All agreed. I'll start and send you the next PR for review in the next few days.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Martinsos/edlib/pull/148#issuecomment-614518388, or unsubscribe https://github.com/notifications/unsubscribe-auth/AALXFBYA4LTGOMX47ZJ46SDRM3DEZANCNFSM4LRXJVRA .