draft: Cythonize - Githubissues

nucccc commented 1 year ago

Good evening, I provide in here a folder with an example of cython code and a benchmark, so that if you want to think about it you can find some material from which to start.

Inside the simylaritymeasures folder, there will be a cy_simylaritymeasures, in which i left only the cython code, scripts and quick documentation in readme.md to build it and run it.

Hoping it can turn out to be of any utility, Nuc

cjekel commented 1 year ago

Very nice! It's going to be until the weekend until I can comment in detail.

I see this is a very minimal cython port. That's great!

I'm curious what happens if you drop

c = distance.cdist(exp_data, num_data, metric='minkowski', p=p)

and use cython to create this matrix? It might be slightly faster to calculate the distance when we need it, rather than creating the large matrix?

Also, we may want to show performance improvements with respect to both the number of data, and the number of dimensions in the data.

nucccc commented 1 year ago

After your suggestion, I even modified the cython code for a hypothetized improvement by eliminating not the distances matrix c = distance.cdist(exp_data, num_data, metric='minkowski', p=p) but rather the ca matrix of the coefficients. That didn't look to me as a great speed improvement, but may remove from ram an additional matrix. All calculations are done inplace on the original distance matrix c, thus modifying it. That one can be seem as a procedural change, and I still to reflect on that, but may lead to a decrease in RAM occupation. I hope it to be possibly useful.

I forgot to mention that as a case I'm actually working just on the frechet distance, but of course the cythonization can be then extended on all the other functions.

Thank you for the attention and the work, Nuc

cjekel commented 1 year ago

Just a bump. I'm still very interested in this comparison, I've just been swamped lately!

cjekel commented 1 year ago

Nice bump adding dtw! I need to test this out asap.

nucccc commented 1 year ago

Ah yes, i just added a dtw comparison. Then i think the benchmark could be better organized, I would put some more work in it.

I'm swamped too and thank you for the attention!

Also if you ever need a quick voice discussion just let me know and we can keep in touch via google meet or something at a point, I'm based in Italy and right now I'm flexible with my schedule.

cjekel commented 11 months ago

I seem to be getting about a 40% improvement gain, that is not sensitive to the total amount of data. This is good! I think I should consider porting the entire library to cython. It would be nice to have both a cython and python version side by side, since the python version is slightly easier to read. Not sure what is the best way to have both...

1.0000510139013212
1.0000510139013212
True
[0.981130201, 1.0351228799999999, 1.0592268329999999, 1.0405540450000004, 1.066770408]
[0.594745455, 0.5694208449999998, 0.5682200880000003, 0.5850636710000003, 0.5942479000000009]
average execution time for non cythonized version: 1.036561
average execution time for cythonized version: 0.582340
improvement: 43 %
100.00171783203773
100.00171783203773
True
[0.9917487250000008, 1.0106956060000005, 0.984660250000001, 1.0517639240000012, 0.9957187320000003]
[0.6305159309999997, 0.6164782420000012, 0.6147694729999991, 0.6011433679999989, 0.6316231439999989]
average execution time for non cythonized version: 1.006917
average execution time for cythonized version: 0.618906
improvement: 38 %

nucccc commented 11 months ago

Hi, in this period I could be a little bit busy, but I could find some time. Recently I red that Cython 3.0 was released, which uses heavily type hinting, and one could resort to type hints such as cython.int or cython.long to actually have types, I would give a look if one could could write pythonic code and then let Cython 3.0 transpile it with that performance gain.

cjekel / similarity_measures

draft: Cythonize #33