Closed timonv closed 3 months ago
Hey. How do you plan on using this? What would the implementation look like?
We'd need a separate interface for sparse vectors. Right now, only dense are supported.
Did I miss something? I thought this was all that was needed, the vectors looked alright locally. I want to generate sparse to do hybrid search with Splade, as I feel bm25 or cutoffs leave a lot on the table.
Looking over at the python implementation, is this the missing part? https://github.com/qdrant/fastembed/blob/9c72d2f59f91f87753da07c12a6f1082e233ecb3/fastembed/sparse/splade_pp.py#L39
That looks straightforward if ndarray supports the math.
All the vectors are normalized in FastEmbed-rs currently.
Looking over at the python implementation, is this the missing part? qdrant/fastembed@9c72d2f/fastembed/sparse/splade_pp.py#L39
That looks straightforward if ndarray supports the math.
Yes.
Cool, I'd be happy to pick it up
For reference, https://github.com/Anush008/bm42-rs.
@Anush008 Ah cool, so if I understand correctly, the whole trick is to first normalize the weights, and then return all non zero values, and the indices for each weight into the token dictionary. And then I suppose it doesn't matter for a search engine what that dictionary is. That's pretty neat.
I've implemented it rather roughly by copying over most from text_embedding and tried to stick to the naming in the python version.
I think the math checks out. I haven't optimized or otherwise cleaned anything up. Perhaps rayon would fit nicely into the post process step as well.
Hey @timonv. Awesome work and super quick turnaround. Thanks a lot.
Please feel free to add yourself to the authors list in the Cargo.toml
file.
All right, looks like that's it from my side. Thanks for the hard work on this library!
Thank you. I think this looks awesome enough to go out.
:tada: This issue has been resolved in version 3.14.0 :tada:
The release is available on:
v3.14.0
Your semantic-release bot :package::rocket:
Hey there, awesome work on this library. I'm enthousiastically using Fastembed.rs in Swiftide, but needed sparse vectors for hybrid search. This PR adds Splade to do just that.