btwael / superstring.py

A fast and memory-optimized string library for heavy-text manipulation in Python
MIT License
250 stars 11 forks source link

Implement in Cython? #2

Open endrebak opened 4 years ago

endrebak commented 4 years ago

Your library seems like it could take advantage of Cython for greater speedups (but no increase in mem-efficiency).

See https://cython.org/

btwael commented 4 years ago

I don't have any experience with Cython but I will check. I initially implemented this in C++ but with no automatic garbage collection in C++, the overhead of using smart pointer was too high.

endrebak commented 4 years ago

The nice thing about Cython is that you can use Python as you would normally, but add type annotations where they can lead to speedups, such as in for loops or when working with basic datatypes such as floats and integers.

btwael commented 4 years ago

Thank you for you fast reply. So the annotated library will have higher performance when running with Cython but still be runnable on Python (annotations ignored)?

endrebak commented 4 years ago

Cython creates an .so file that can easily be used as a regular library from Python. Cython is a very mature project used by some of the most popular and important libraries in the Python ecosystem, such as pandas. You run files created with Cython in Python.

It basically lets you combine C and or C++ and Python to create a library or function that can be used as a regular Python library.

Here you can get a feel for what Cython looks like (it is not an example of good code XD): https://github.com/biocore-ntnu/ncls/blob/master/ncls/src/ncls.pyx

You can see:

And the usage of the NCLS library is as if it were any other Python library:

# regular_script.py
from ncls import NCLS
...

Of course, I cannot know how much your library would be sped up by using Cython. But in the example above my implementation is more than a 1000 times faster than the pure python version. It works best if you have contiguous data (i.e. data in arrays) represented with basic data types.

btwael commented 4 years ago

@endrebak if this give such performance, we have to try it and see what we will get. Probably, I will keep the current code as it is in the main branch, and play with Cython in another one. And we do a comparison at the end and keep the most performant.

endrebak commented 4 years ago

Only if you are interested and have time of course :)