Add SIMD-accelerated APIs

ashvardanian commented 9 months ago

This was indented as a small path upgrading from JellyFish to StringZilla to accelerate some of the slowest and frequently used string similarity measures. Along the way I've patched a few minor things.

Hamming and Levenshtein support SIMD and buffers.
Added docstrings for all APIs.
Fixed non-standard 5-char indent in functions.py.
Upgraded PyTest for compatibility with newer Pyhton.
Added pkg_resources for setuptools for tests.

Compared to JellyFish, StringZilla is generally at least 20% faster even on shorter strings. It is also more accurate, as JellyFish doesn't correctly handle Unicode strings. Here is a comparison table for the distance output by different packages.

	Example	Jellyfish	Levenshtein	RapidFuzz	EditDistance	NLTK	StringZilla (Unicode)	StringZilla (Bytes)
0	apple vs aple	1	1	1	1	1	1	1
1	αβγδ vs αγδ	1	1	1	1	1	1	2
2	école vs école	1	2	2	2	2	2	3
3	Schön vs Schön	1	2	2	2	2	2	3
4	💖 vs 💗	1	1	1	1	1	1	1
5	𠜎𠜱𠝹𠱓 vs 𠜎𠜱𠝹𠱓	3	3	3	3	3	3	3
6	München vs Muenchen	2	2	2	2	2	2	2
7	façade vs facade	1	1	1	1	1	1	2
8	こんにちは世界 vs こんばんは世界	2	2	2	2	2	2	3
9	👩‍👩‍👧‍👦 vs 👨‍👩‍👧‍👦	1	1	1	1	1	1	1
10	Data科学123 vs Data科學321	3	3	3	3	3	3	3
11	🙂🌍🚀 vs 🙂🌎✨	2	2	2	2	2	2	5

MrPowers commented 9 months ago

@ashvardanian - thanks for submitting this. Do you have any benchmarks that show StringZilla makes ceja faster?

ashvardanian commented 9 months ago

@MrPowers I don't have benchmarks specific to Ceja, but have several benchmarks against Jellyfish in the StringZilla repository. There is also a Jupyter notebook to help explore the differences at stringzilla/scripts/bench_similarity.ipynb 🤗

Is there some specific benchmark you have in mind?

PS: There is also a portability issue I haven't referenced. Seems like jellyfish builds only 65 wheels, while today PyPi expects 105 targets. StringZilla publishes all of them.

MrPowers / ceja

Add SIMD-accelerated APIs #7