Anush008 / fastembed-rs

Library for generating vector embeddings, reranking in Rust
https://docs.rs/fastembed
Apache License 2.0
285 stars 41 forks source link

feat: Add v1 Qdrant/Splade_PP_en_v1 #94

Closed timonv closed 3 months ago

timonv commented 3 months ago

Hey there, awesome work on this library. I'm enthousiastically using Fastembed.rs in Swiftide, but needed sparse vectors for hybrid search. This PR adds Splade to do just that.

Anush008 commented 3 months ago

Hey. How do you plan on using this? What would the implementation look like?

Anush008 commented 3 months ago

We'd need a separate interface for sparse vectors. Right now, only dense are supported.

timonv commented 3 months ago

Did I miss something? I thought this was all that was needed, the vectors looked alright locally. I want to generate sparse to do hybrid search with Splade, as I feel bm25 or cutoffs leave a lot on the table.

Looking over at the python implementation, is this the missing part? https://github.com/qdrant/fastembed/blob/9c72d2f59f91f87753da07c12a6f1082e233ecb3/fastembed/sparse/splade_pp.py#L39

That looks straightforward if ndarray supports the math.

Anush008 commented 3 months ago

All the vectors are normalized in FastEmbed-rs currently.

Anush008 commented 3 months ago

Looking over at the python implementation, is this the missing part? qdrant/fastembed@9c72d2f/fastembed/sparse/splade_pp.py#L39

That looks straightforward if ndarray supports the math.

Yes.

timonv commented 3 months ago

Cool, I'd be happy to pick it up

Anush008 commented 3 months ago

For reference, https://github.com/Anush008/bm42-rs.

timonv commented 3 months ago

@Anush008 Ah cool, so if I understand correctly, the whole trick is to first normalize the weights, and then return all non zero values, and the indices for each weight into the token dictionary. And then I suppose it doesn't matter for a search engine what that dictionary is. That's pretty neat.

I've implemented it rather roughly by copying over most from text_embedding and tried to stick to the naming in the python version.

I think the math checks out. I haven't optimized or otherwise cleaned anything up. Perhaps rayon would fit nicely into the post process step as well.

Anush008 commented 3 months ago

Hey @timonv. Awesome work and super quick turnaround. Thanks a lot.

Please feel free to add yourself to the authors list in the Cargo.toml file.

timonv commented 3 months ago

All right, looks like that's it from my side. Thanks for the hard work on this library!

Anush008 commented 3 months ago

Thank you. I think this looks awesome enough to go out.

github-actions[bot] commented 3 months ago

:tada: This issue has been resolved in version 3.14.0 :tada:

The release is available on:

Your semantic-release bot :package::rocket: