Martinsos / edlib

Lightweight, super fast C/C++ (& Python) library for sequence alignment using edit (Levenshtein) distance.
http://martinsos.github.io/edlib
MIT License
493 stars 162 forks source link

allow to take in bytes objects #117

Closed drmorr0 closed 6 years ago

drmorr0 commented 6 years ago

Hi,

I found your library earlier this evening and was very happy because I was thinking I was going to have to write this myself :)

I created this PR based on @kurtbrose suggestions in #100. I also mainly care about being able to pass in raw byte-strings but went ahead and fixed the other issues they mentioned too. It looks like this all works fine, I added a new simple test, but I'll be doing some more robust testing in my application and can report any bugs I find.

Cheers!

drmorr0 commented 6 years ago

Hmm. I don't understand why the tests failed, it passes locally. I'll investigate some more.

drmorr0 commented 6 years ago

Oh, I see, I was using Python3 but the tests are using Python2. This version should work in both 2 and 3.

Martinsos commented 6 years ago

Great @drmorr0! Thanks for explaining the details. Regarding the GIL, I think I should mention that Edlib is not currently using any concurrency, everything is run on single thread, so there will be no improvements in this case, but I guess it is still nice to have nogil here if Edlib in future does any concurrent operations.

Please squash the commits as you think is appropriately (probably all of them into one?) and I will merge the PR!

drmorr0 commented 6 years ago

Sure, just did! With regards to the GIL, it's not so much whether your C code does any parallel operations; what the "nogil" enables is for a Python program to call "align" in multiple threads and have them run in parallel.

Martinsos commented 6 years ago

Ok I am merging! @drmorr0 but again that will not be utilized in this case, because our python code is not calling C align in multiple threads, correct? In which case is this going to bring any benefits? If somebody calls edlib's python method in multiple threads, I am guessing that it will still not all run in parallel since only part of our code is under nogil, or how would that work?

kurtbrose commented 6 years ago

:-) glad to see the improvements! sorry I kind of disappeared

@Martinsos by default, any C extension of Python will hold onto the GIL and the Python VM will not run any other threads; in return for leaving the GIL alone you get the following benefits:

1- freedom to access / create / destroy python objects without worrying about corrupting memory (some access of python objects is still safe even without holding the GIL, but you need to be much more careful)

2- your own code gets the concurrency protection of the GIL -- you don't have to worry about multiple threads calling from python into your code at the same time; other threads won't be able to acquire the GIL until the first thread finishes its call

On the other hand, releasing the GIL gives the following concurrency benefits:

1- other threads in the process may execute python / C code while your code is running

2- your code may be called from multiple threads concurrently

note -- if your code isn't thread-safe, you'll need to protect it with some alternate lock if releasing the GIL

the easiest way is probably to use a threading.Lock from python

EDLIB_LOCK = threading.Lock()

def edlib(...):
   # ...
   with EDLIB_LOCK:
      with nogil:
         # call to C code
drmorr0 commented 6 years ago

@Martinsos Thanks a lot! At some point when you have some time can we put a new version up on pip? No big rush here.

@kurtbrose Thanks for the clarifications there. The GIL is confusing.

Martinsos commented 6 years ago

@kurtbrose thanks for reaching out, and for explaining the GIL in more details! Edlib is written without any global state or static variables or anything like that, it is just pure functions, so it should be thread safe.

@drmorr0 Ah yes I completely forgot to push new release :)! I just pushed it, it is now on pip.

Thanks everybody!