V4 Wishlist - Githubissues

ashvardanian / StringZilla

Up to 10x faster strings for C, C++, Python, Rust, and Swift, leveraging NEON, AVX2, AVX-512, and SWAR to accelerate search, sort, edit distances, alignment scores, etc 🦖

https://ashvardanian.com/posts/stringzilla/

Apache License 2.0

2.2k stars 76 forks source link

V4 Wishlist #110

Open ashvardanian opened 8 months ago

ashvardanian commented 8 months ago

Features

[ ] Better hashing algorithms
[ ] Automata-based fuzzy searching algorithms

Breaking naming and organizational changes

[ ] Rename edit_distance to levenshtein_distance to match Hamming

Any other requests?

happysalada commented 7 months ago

Reading the readme, it doesn't mention about processing files that are compressed. Of course, the file can be decompressed first in some other way, but it would be nice to have a way to process a compressed file without having to load it first in memory. Let me be more specific, here is how you could process line by line with python

import gzip

with gzip.open('input.gz','rt') as f:
    for line in f:

but what if I'm going to ignore several lines anyways. Having some form of efficient search through compressed files would be nice. Thank you for making this project open source!

ghost commented 7 months ago

hi there, I would love to know what is the current hashing algos? And on automata-based fuzzy searching, will it perform better than current string search algo on paper and design? Thanks!

ashvardanian commented 7 months ago

@happysalada, search through compressed data is an attractive feature proposition. I've been thinking about it a lot over the years, but it's not trivial for most compression types. Will keep in mind.

@0xqd, we currently implement Rabin-style hashing and fingerprinting documented here. The header file also provides some details:

https://github.com/ashvardanian/StringZilla/blob/57209cb389d28e63cf5238af01296db995342b61/include/stringzilla/stringzilla.h#L925-L951

I am looking into alternative algorithms as well, but want the primary hash and the rolling hash to use the same schema.