Martinsos / edlib

Lightweight, super fast C/C++ (& Python) library for sequence alignment using edit (Levenshtein) distance.
http://martinsos.github.io/edlib
MIT License
506 stars 165 forks source link

enhance python API (want PR?) #100

Open kurtbrose opened 6 years ago

kurtbrose commented 6 years ago

There seem to be some low hanging fruits in enhancing the cython/python API.

1- calling encode() to convert unicode to bytes; encode('utf-8') is safer 2- accept bytes and pass them through 3- switch from malloc to PyMem_Malloc, so that the python VM can do a better job of utilizing allocation pools (http://cython.readthedocs.io/en/latest/src/tutorial/memory_allocation.html -- cython docs showing how and explaining this is preferred) -- maybe there is some reason this can't be done because the C++ code is using raw free on allocated memory?

4)

Regarding this:

            # TODO(martin): Is there a better way to do this conversion? There must be.
            tmp_bytes = additionalEqualities[i][0].encode();
            tmp_cstring = tmp_bytes;
            cconfig.additionalEqualities[i].first = tmp_cstring[0]
            tmp_bytes = additionalEqualities[i][1].encode();
            tmp_cstring = tmp_bytes;
            cconfig.additionalEqualities[i].second = tmp_cstring[0]

I think it can be simplified to this:

            cconfig.additionalEqualities[i].first = additionalEqualities[i][0].encode('utf-8')
            cconfig.additionalEqualities[i].second =additionalEqualities[i][1].encode('utf-8')

Cython can see the type of the field is char and will automatically convert bytes. You can do <char> as well to force the conversion. Again maybe you tried that and ran into some issue?

5- release the GIL

  with nogil:
      cresult = cedlib.edlibAlign(cquery, len(query), ctarget, len(target), cconfig)

Since this can be quite slow (10ms+) it is worth releasing the GIL so that this code can be multi-threaded from python.

Martinsos commented 6 years ago

Hey @kurtbrose, thanks for all the suggestions! This was my first time and only time I was using Cython or writing a Python extension so I am absolutely sure there is lot to improve, and would love to hear any advice.

Let's go through each point:

  1. sure, makes sense
  2. makes sense to do it if that is what people what to input. Maybe we could just check in the method to see if type is bytes -> if not, convert them, and document that both are accepted.
  3. I was freeing it a few lines lower with free, so sure, if PyMem_Malloc is better let's use that (I didn't know about it).
  4. Unfortunately I can't remember, I know I tried some simpler methods but could not get it working. But maybe I have not tried this -> if it works, awesome.
  5. Certainly! I did not know about GIL :). This allows this line to execute multi-threaded or it allows caller to call the whole python method multi-threaded? Because edlibAlign method is not multi-threaded and will probably not be for long time / ever.

Regarding PR -> yes that would be awesome! @mahmoud suggested very similar changes as you, so we could if we can somehow combine your knowledge/ideas :D, maybe the best way is you make a PR and then we also include him.

Martinsos commented 6 years ago

@kurtbrose how is this going, don't give up :D! It would be nice to have Python library improved, let me know if you need any help or got stuck on anything.