caesar0301 / treelib

An efficient implementation of tree data structure in python 2/3.
http://treelib.readthedocs.io/en/latest/
Other
812 stars 186 forks source link

cython implementation #99

Open dtrckd opened 6 years ago

dtrckd commented 6 years ago

Hi, Is there any plan around for a cython implementation of treelib to make it real fast ?

AdeBC commented 4 years ago

Hi, I'm also looking for a package that has APIs similar to treelib and implemented with cython, cause I'm using the treelib to deal with tens of millions of data. Maybe we can make it ourselves or just make the treelib better? what do you think about it? I already have one but cythonize that is a little bit complex, are you familier with it?

ShravanTata commented 4 years ago

Hi @AdeBC, I just discovered treelib. Seems like a fun project. I would be interested to cythonize the backend. I am quite fluent with Cython. But the first step would be to setup benchmarking so that we clearly know what to optimize and rewrite only the critical areas

dtrckd commented 4 years ago

Hi guys, I have cythonized treelib for the need of an application I was working on some years ago now. It was really rewarding to do so. I am not so sure, but I think it was about 10 time faster (while also more memory efficient).

As all the operation of treelib are based on python dict and python loop soI found it hard to not cythonize everything. Basically, I used this structure: from libcpp.unordered_map cimport unordered_map as _map to store the nodes...

AdeBC commented 4 years ago

Hi @ShravanTata , same as what you have mentioned, I also think that cythonize the time-consuming APIs (not all of them) is a very good choice. But as far as I know, it is very hard to gain some speed-up through cythonizing a part of treelib, because treelib is based on a Python Dict (just as @dtrckd said) and just provide APIs for single node operations. When dealing with small amount of data, the treelib just works very well. But it's quite slow when dealing with millions of data because you cannot avoid using explicit for loop of Python, which is very very slow. So the problem doesn't come from treelib itself, it comes from slow for loop of Python.

So now what I'm considering is to cythonize the bases of treelib, so that we can provide vectorized APIs for large amount of data. Just like what R & Numpy are doing. That means we need to cythonize almost the entire treelib and write for loops in C++ I guess, I guess that's what vectorization is doing. That's a lot of work!

ShravanTata commented 4 years ago

@AdeBC : I completely agree with the points you make. It would indeed end up having to rewrite some big core chunks of the library in C++ or cython if not all of it. Have you already started work on this? If we have a good plan to break up the work then we can better estimate the work needed to do this. Let me know your thoughts