Anush008 / fastembed-rs

Library for generating vector embeddings, reranking in Rust
https://docs.rs/fastembed
Apache License 2.0
264 stars 36 forks source link

feat: `EmbeddingOutput` to allow for custom output key and transformations #102

Closed denwong47 closed 2 months ago

denwong47 commented 2 months ago

This PR enables a slightly more structured way to handle the output types available in ONNX models. Currently, once the session had generated SessionOutputs, TextEmbedding::embed will either

While output types like "last_hidden_state" and "pooler_output" are commonly supported, the actual keys exported by the model is controlled by ONNX configuration; "last_hidden_state" is not guaranteed to exist.

Once the output had been fetched with the key, TextEmbedding::embed will assume the output matrix is 3-dimentional; it will panic otherwise due to a ArrayBase::slice using 3 axis.

Apparently this may not always hold true - and for text embeddings 2-dimensional (No of input x Embeddings Dim) could perhaps be more common.

This PR introduces a check and deals with 2D/3D matrix differently.

Motivation

This came about when I tried to use sentence-transformers/all-mpnet-base-v2 with fastembed-rs. The model does not natively have a ONNX equivalent, so editing models_list() to add the entry is not an option.

For this reason, optimum-cli was used to convert the model, and outputs were successfully generated from a TextEmbedding<UserDefinedEmbeddingModel> - only for the pipeline to panic afterwards.

I am assuming that the use of such a model is supported (at least evidently, it works); happy to be advised otherwise.

Test Plan

An additional integration test had been added to tests folder behind an empty flag called optimum-cli. The test will externally call optimum-cli to pull models into the default cache folder. It is assumed that not every one will have the CLI tool or even Python installed, thus the flag is added to not run the test by default.

Once the models are pulled, they will be loaded as UserDefinedEmbeddingModel instances and dimension tested for their output.

If you have optimum-cli installed and available in your shell environment, you can run this test via

cargo test --features=optimum-cli

Known Issues

Personally I would prefer the output_key to be exposed; these models from sentence-transformers appear to contain two outputs, sentence_embedding and token_embeddings; and perhaps there are endless variations out there we won't be able to capture. However as I could not introduce a new parameter to the function without breaking changes, I will leave this as a proposal here for maintainers to consider.

Many thanks for reviewing and much appreciate your time.

Anush008 commented 2 months ago

Also, please feel free to add yourself to the authors list in the Cargo.toml file.

denwong47 commented 2 months ago

Thanks @Anush008, I'll add Python and pip install optimum-cli to the CI workflow. I'll await #101 before changing the TextEmbedding call.

As for the breaking changes, I have the following proposal that may not be breaking afterall:

Let me know if you have any thoughts regarding the above?

Anush008 commented 2 months ago

Hey. That's a solid plan. Thank you.

Once we have 3 PRs merged, will do some Q&A tests before releasing v4.

Also, cc @triandco since this plan might be helpful for #99.

denwong47 commented 2 months ago

I had done some work to the above, only to realise that #99 made some changes to how the array transformation works. Now that attention_mask needs to be used during the transformation from ndarray::Array to Vec<Embeddings>, I need to rethink a little on how to implement EmbeddingOutputType; I'll have a think and come back in a few days.

denwong47 commented 2 months ago

Also in typical Python fashion, installing optimum had completely filled up the disk space on Github Runner... I don't know if we should skip that test or give the runner more resources. 😞

Anush008 commented 2 months ago

I guess we can skip in that case.

Anush008 commented 2 months ago

This is masterful work @denwong47. Thanks a lot.

Anush008 commented 2 months ago

I am a little worried about the many lifetime params and generics, it could push back newer contributors due to the complexity. The library source has been pretty straightforward to grok so far.

denwong47 commented 2 months ago

I am a little worried about the many lifetime params and generics, it could push back newer contributors due to the complexity. The library source has been pretty straightforward to grok so far.

I can cut back a bit on that, I think EmbeddingOutputType won't be generic to EmbeddingOutput anymore, since we need to put in attention_mask, stopping the transformation from being static. So that's one down.

We won't be able to do a lot for the <'r, 's> which is from ort::SessionOutputs though; but I'll have a scan again tomorrow to make sure those are well contained within EmbeddingOutput and nowhere else.

Anush008 commented 2 months ago

πŸ™πŸ™πŸ™

Thanks.

denwong47 commented 2 months ago

Looking at it, I can also hardcode f32 as the only type usable with .try_extract_tensor::<F>(). This will eliminate all the F: ort::PrimitiveTensorElementType + Clone floating around. However this will prevent compatibility with models that use a different floating size; which I gather had not been a requirement so far?

Anush008 commented 2 months ago

Which I gather had not been a requirement so far?

Nope. f32 has sufficed.

denwong47 commented 2 months ago

I have done quite a lot of refactoring to hopefully simplify things a bit. All but one Generics are gone, and I have done some work in preparation for #99, such as a SingleBatchOutput class (please help me name this) that allows attention_mask to be stored alongside the session_outputs so that we can perform per-batch post-processing as we would from that PR.

I probably need to halt work for now until #99 is completed; then I'll merge/replicate the work over.

I have also added some value checking in the tests, in addition to the previous dimensional checks. This is done by summing the vector together and checking the resultant value - this should be safe enough for unit tests.

Let me know any thoughts, much appreciated.

P.S. I'll squash later.

denwong47 commented 2 months ago

Not sure what's going on with the tests; I'll test on Ubuntu - probably there are minor differences between architecture and OS.

Anush008 commented 2 months ago

How do you think we should go forward? Merge this PR and wait for #99 to adapt? Or should we wait for #99 to be merged and adapt this with the post-process steps it adds?

Anush008 commented 2 months ago

probably there are minor differences between architecture and OS?

Maybe try an even lower precision when comparing. 3 digits after . should be good.

denwong47 commented 2 months ago

I am a bit worried that merging this PR will be a big shock for @triandco and make it very difficult for him to follow. I am happy to for him to work out the pooling first, and I'll adapt mine to suit.

Anush008 commented 2 months ago

I am leaning towards

we wait for https://github.com/Anush008/fastembed-rs/pull/99 to be merged and adapt this with the post-process steps it adds?

So that @triandco won't have to bother about the changes here and can continue with the current state of master in #99.

What do you think @denwong47?

denwong47 commented 2 months ago

What do you think @denwong47?

Exactly on the same page.

Anush008 commented 2 months ago

I am a bit worried that merging this PR will be a big shock for @triandco and make it very difficult for him to follow. I am happy to for him to work out the pooling first, and I'll adapt mine to suit.

Haha. We were on the same lines. Let's do that.

denwong47 commented 2 months ago

I have rebased and regenerated all the test results using HEAD of #99, everything is passing.

However I feel like something is amiss - if I change the batch size from None to Some(3) (which forces 2 batches), then the Mean tests start failing. The same behaviour is observed in 3.14.1 and the current HEAD, but I thought that was what Pooling was meant to fix? Bottomline is, batching should not have an effect on the embedding values...?

I'll sit down with my Data Scientist tomorrow or Wednesday (@harrysjohnston your time to shine) to go over what's happening - just to make sure we are on the right track here.

Before that, revert this to draft state.

denwong47 commented 2 months ago

Rebased again to current HEAD, all tests passing on None batch size.

Anush008 commented 2 months ago

Bumping https://github.com/Anush008/fastembed-rs/pull/102#discussion_r1713359378.

denwong47 commented 2 months ago

@Anush008 the bottomline is, this PR does not change the output of the embeddings - the test results were generated from using the HEAD of #99 and tests still pass, so I may be trying to deal with an existing issue here, which may be better with a separate PR.

In any case I'll get my dose of education on Pooling from my colleague first, as I could potentially be talking non-sense.

You can advise me on how you want to proceed - merge this first or not.

denwong47 commented 2 months ago

P.S. I still have to split out output as separate file if you want to merge this.

Anush008 commented 2 months ago

I am willing to merge this to keep the scope limited.

denwong47 commented 2 months ago

Ok, in that case mod split coming right up.

denwong47 commented 2 months ago

All done!

denwong47 commented 2 months ago

Thanks @Anush008 . Great working with you.

I'll report back on the batch and Pooling::Mean interaction.

Anush008 commented 2 months ago

Thank you.

github-actions[bot] commented 1 month ago

:tada: This issue has been resolved in version 4.0.0 :tada:

The release is available on:

Your semantic-release bot :package::rocket: