BurntSushi / fst

Represent large sets and maps compactly with finite state transducers.
The Unlicense
1.76k stars 123 forks source link

fuzzy search by overlap #135

Closed vsraptor closed 2 years ago

vsraptor commented 2 years ago

let say i store list of list of numbers f.e

1,5,785,890 4,5,22,89 11,5,88,124

searching for 1,22,89,124

will find 4,5,22,89

CAN FST be made to search by overlap, hamming dist, jaccard any of them

BurntSushi commented 2 years ago

This crate provides an FST with an arbitrary byte sequence as a key, and an unsigned 64-bit integer as a value. The data you've given me doesn't appear to fit that format, so I don't know how to answer your question.

If you're asking about posting lists for building an IR index, then yes, they can be done with FSTs and additional structures.

vsraptor commented 2 years ago

the list of number represent the indexes of sparse binaries say 2000 bits with sparsity of 2%.. so ~40 elements.

then i want to store 1000 to 1mln lists

AND finally given sparse vector i.e. list of indecies , find the top-N closest by OVERLAP i.e. count(A & B)

BurntSushi commented 2 years ago

I still don't understand your question, sorry. You're welcome to try and explain more, but I don't have much time to offer you. So you'll really need to be much clearer.

I think you're best bet is to review the docs for this crate and see whether your data fits.

vsraptor commented 2 years ago

Sorry :) let me try again..

I represent a binary 100101001 using the indecies of the 1s, so it becomes [1,4,6,9]

Now I want to save millions of list of numbers like the above.

Finally I receive a new binary X. I want to find the closest match to X using OVERLAP as a measure.

FST can do that using Levenstein distance. I simply want to use Overlap instead.

overlap('hello', 'hell') = 4

thanks

BurntSushi commented 2 years ago

If you can express your function as an operation on finite state machines, then sure. That's how Levenshtein works.