MrPowers / ceja

PySpark phonetic and string matching algorithms
MIT License
35 stars 5 forks source link

Add SIMD-accelerated APIs #7

Open ashvardanian opened 9 months ago

ashvardanian commented 9 months ago

This was indented as a small path upgrading from JellyFish to StringZilla to accelerate some of the slowest and frequently used string similarity measures. Along the way I've patched a few minor things.

Compared to JellyFish, StringZilla is generally at least 20% faster even on shorter strings. It is also more accurate, as JellyFish doesn't correctly handle Unicode strings. Here is a comparison table for the distance output by different packages.

Example Jellyfish Levenshtein RapidFuzz EditDistance NLTK StringZilla (Unicode) StringZilla (Bytes)
0 apple vs aple 1 1 1 1 1 1 1
1 αβγδ vs αγδ 1 1 1 1 1 1 2
2 école vs école 1 2 2 2 2 2 3
3 Schön vs Schön 1 2 2 2 2 2 3
4 💖 vs 💗 1 1 1 1 1 1 1
5 𠜎 𠜱 𠝹 𠱓 vs 𠜎𠜱𠝹𠱓 3 3 3 3 3 3 3
6 München vs Muenchen 2 2 2 2 2 2 2
7 façade vs facade 1 1 1 1 1 1 2
8 こんにちは世界 vs こんばんは世界 2 2 2 2 2 2 3
9 👩‍👩‍👧‍👦 vs 👨‍👩‍👧‍👦 1 1 1 1 1 1 1
10 Data科学123 vs Data科學321 3 3 3 3 3 3 3
11 🙂🌍🚀 vs 🙂🌎✨ 2 2 2 2 2 2 5
MrPowers commented 9 months ago

@ashvardanian - thanks for submitting this. Do you have any benchmarks that show StringZilla makes ceja faster?

ashvardanian commented 9 months ago

@MrPowers I don't have benchmarks specific to Ceja, but have several benchmarks against Jellyfish in the StringZilla repository. There is also a Jupyter notebook to help explore the differences at stringzilla/scripts/bench_similarity.ipynb 🤗

Is there some specific benchmark you have in mind?


PS: There is also a portability issue I haven't referenced. Seems like jellyfish builds only 65 wheels, while today PyPi expects 105 targets. StringZilla publishes all of them.