Dynamic Quantization causes batching to change the output values

denwong47 commented 3 months ago

Symptoms

The 6 quantized models in this crate

Alibaba-NLP/gte-base-en-v1.5
Alibaba-NLP/gte-large-en-v1.5
mixedbread-ai/mxbai-embed-large-v1
nomic-ai/nomic-embed-text-v1.5
Xenova/all-MiniLM-L12-v2
Xenova/all-MiniLM-L6-v2

(basically any ModelInfo { model_file: "model_quantized.onnx" })

will produce different embeddings as the batch size changes.

To Reproduce

Run

cargo test tests::test_embeddings -- --nocapture

The test should pass.

Change batch size in the unit test to Some(3).

Assertions should fail on the above 6 models:

Mismatched embeddings for model Alibaba-NLP/gte-base-en-v1.5 at index 0: -1.6916804 != -1.7032102
Mismatched embeddings for model Alibaba-NLP/gte-base-en-v1.5 at index 1: -1.7150338 != -1.7076654
Mismatched embeddings for model Alibaba-NLP/gte-base-en-v1.5 at index 2: -1.732108 != -1.729326
Mismatched embeddings for model Alibaba-NLP/gte-base-en-v1.5 at index 3: -1.530315 != -1.5317788
Mismatched embeddings for model Alibaba-NLP/gte-base-en-v1.5: ["Hello, World!", "This is an example passage.", "fastembed-rs is licensed under Apache-2.0", "Some other short text here blah blah blah"]
Mismatched embeddings for model Alibaba-NLP/gte-large-en-v1.5 at index 0: -1.6144238 != -1.6044945
Mismatched embeddings for model Alibaba-NLP/gte-large-en-v1.5 at index 1: -1.6605772 != -1.6469251
Mismatched embeddings for model Alibaba-NLP/gte-large-en-v1.5 at index 2: -1.6774151 != -1.6828246
Mismatched embeddings for model Alibaba-NLP/gte-large-en-v1.5 at index 3: -1.6387382 != -1.6265479
Mismatched embeddings for model Alibaba-NLP/gte-large-en-v1.5: ["Hello, World!", "This is an example passage.", "fastembed-rs is licensed under Apache-2.0", "Some other short text here blah blah blah"]
Mismatched embeddings for model mixedbread-ai/mxbai-embed-large-v1 at index 0: -0.21646263 != -0.1811538
Mismatched embeddings for model mixedbread-ai/mxbai-embed-large-v1 at index 1: -0.2818942 != -0.2884392
Mismatched embeddings for model mixedbread-ai/mxbai-embed-large-v1 at index 2: -0.15454057 != -0.1636593
Mismatched embeddings for model mixedbread-ai/mxbai-embed-large-v1 at index 3: -0.21518138 != -0.21548103
Mismatched embeddings for model mixedbread-ai/mxbai-embed-large-v1: ["Hello, World!", "This is an example passage.", "fastembed-rs is licensed under Apache-2.0", "Some other short text here blah blah blah"]
Mismatched embeddings for model nomic-ai/nomic-embed-text-v1.5 at index 0: 0.20303695 != 0.20999804
Mismatched embeddings for model nomic-ai/nomic-embed-text-v1.5 at index 1: 0.14249149 != 0.13103808
Mismatched embeddings for model nomic-ai/nomic-embed-text-v1.5 at index 2: 0.13759416 != 0.14427708
Mismatched embeddings for model nomic-ai/nomic-embed-text-v1.5 at index 3: 0.14991708 != 0.13452803
Mismatched embeddings for model nomic-ai/nomic-embed-text-v1.5: ["Hello, World!", "This is an example passage.", "fastembed-rs is licensed under Apache-2.0", "Some other short text here blah blah blah"]
Mismatched embeddings for model Xenova/all-MiniLM-L12-v2 at index 0: -0.078739226 != -0.07808663
Mismatched embeddings for model Xenova/all-MiniLM-L12-v2 at index 1: 0.3097751 != 0.27919534
Mismatched embeddings for model Xenova/all-MiniLM-L12-v2 at index 2: -0.051262308 != -0.0770612
Mismatched embeddings for model Xenova/all-MiniLM-L12-v2 at index 3: -0.5380101 != -0.75660324
Mismatched embeddings for model Xenova/all-MiniLM-L12-v2: ["Hello, World!", "This is an example passage.", "fastembed-rs is licensed under Apache-2.0", "Some other short text here blah blah blah"]
Mismatched embeddings for model Xenova/all-MiniLM-L6-v2 at index 0: 0.56476647 != 0.5677276
Mismatched embeddings for model Xenova/all-MiniLM-L6-v2 at index 1: 0.35537243 != 0.40180072
Mismatched embeddings for model Xenova/all-MiniLM-L6-v2 at index 2: -0.1562563 != -0.15454668
Mismatched embeddings for model Xenova/all-MiniLM-L6-v2 at index 3: -0.49482125 != -0.4672576
Mismatched embeddings for model Xenova/all-MiniLM-L6-v2: ["Hello, World!", "This is an example passage.", "fastembed-rs is licensed under Apache-2.0", "Some other short text here blah blah blah"]

Cause

This is a known behaviour of Quantized model due to dynamic quantization:

quote from PyTorch The key idea with dynamic quantization as described here is that we are going to determine the scale factor for activations dynamically based on the data range observed at runtime.

This means that the data range will be observed within each batch. This makes the embeddings generated incomparable across batches.

Proposed Solution

I am currently researching this to see if we have any ways of defining the data range independent of the input data. I will also need to look deeper into how other packages deal with this.

However it is worth noting that even if we solve this issue, fundamentally any embeddings generated for a set of documents will not be comparable to embeddings for another set of documents. Embeddings from a quantized model is only of meaning when used within itself. So there is certainly an argument to say that batching naturally do not go well with quantized models, and we may just disable it when dynamic quantization is in use.

Anush008 commented 3 months ago

So there is certainly an argument to say that batching naturally do not go well with quantized models, and we may just disable it when dynamic quantization is in use.

We can do that I think if that would be a simpler, straightforward solution.

Anush008 commented 3 months ago

I also see the individual values are not way off. Still you're right. They're different.

denwong47 commented 3 months ago

We can do that I think if that would be a simpler, straightforward solution.

I'll do a quick PR after work.

denwong47 commented 3 months ago

I also see the individual values are not way off. Still you're right. They're different.

Due to the values depending on the data range of the batch, one can possibly craft 2 polar opposite strings that gives completely different results when embedded individually and batched.

github-actions[bot] commented 3 months ago

:tada: This issue has been resolved in version 4.0.0 :tada:

The release is available on:

GitHub release
v4.0.0

Your semantic-release bot :package::rocket:

Anush008 / fastembed-rs