Open ashvardanian opened 9 months ago
@ashvardanian - thanks for submitting this. Do you have any benchmarks that show StringZilla makes ceja faster?
@MrPowers I don't have benchmarks specific to Ceja, but have several benchmarks against Jellyfish in the StringZilla repository. There is also a Jupyter notebook to help explore the differences at stringzilla/scripts/bench_similarity.ipynb
🤗
Is there some specific benchmark you have in mind?
PS: There is also a portability issue I haven't referenced. Seems like jellyfish
builds only 65 wheels, while today PyPi expects 105 targets. StringZilla publishes all of them.
This was indented as a small path upgrading from JellyFish to StringZilla to accelerate some of the slowest and frequently used string similarity measures. Along the way I've patched a few minor things.
functions.py
.pkg_resources
forsetuptools
for tests.Compared to JellyFish, StringZilla is generally at least 20% faster even on shorter strings. It is also more accurate, as JellyFish doesn't correctly handle Unicode strings. Here is a comparison table for the distance output by different packages.