ashvardanian / StringZilla

Up to 10x faster strings for C, C++, Python, Rust, and Swift, leveraging NEON, AVX2, AVX-512, and SWAR to accelerate search, sort, edit distances, alignment scores, etc 🦖
https://ashvardanian.com/posts/stringzilla/
Apache License 2.0
2.05k stars 66 forks source link

Add search/split iterators for Python #80

Closed ashvardanian closed 5 months ago

ashvardanian commented 7 months ago

In C++ we have special smart iterators for bulk search and split operations. They lazily report the matches, avoiding heap allocations for the array of match offsets.

For that, an arbitrary matcher (string / character / character set ; in normal / reverse order) is combined with search / split ranges. Similar functionality should be added in Python, where we currently materialize the matches into a "compressed" Strs object.

ghazariann commented 6 months ago

I'm very interested in contributing to this project as my first step into open-source. I believe I could start by addressing this issue. To clarify, are we aiming to replace the Strs type with something like StrIterator that yields strings lazily? As a first step, should I focus on modifying the split function to eliminate the use of realloc and ensure it returns an iterator instead? Any guidance on this would be greatly appreciated.

ashvardanian commented 6 months ago

Hi @ghazariann! I don't think we should replace the Strs. We should keep both. The split should provide an iterator, which should if materialized, is converted to Strs. How does that sound?

ashvardanian commented 5 months ago

Added in 3b6cdddbcba50b326eb44fa799381b978d99bdc5