McGill-NLP / llm2vec

Code for 'LLM2Vec: Large Language Models Are Secretly Powerful Text Encoders'
https://mcgill-nlp.github.io/llm2vec/
MIT License
816 stars 59 forks source link

Different embeddings obtained when running with different batch size #68

Closed wufeim closed 1 month ago

wufeim commented 1 month ago

Thanks for sharing this awesome work.

I'm trying a simple symmetric text retrieval demo, which involves computing text embeddings for text retrieval. What I don't understand is why I get different embeddings when I run l2v with one caption or multiple captions:

sentences = [
    "how much protein should a female eat",
    "summit define",
    "As a general guideline",
    "Definition of summit for English Language Learners"]
print(l2v.encode(sentences[0:1])[0:1, :10])
print(l2v.encode(sentences[0:2])[0:1, :10])
print(l2v.encode(sentences[0:3])[0:1, :10])
print(l2v.encode(sentences[0:4])[0:1, :10])

The print statements would always output the first 10 out of 4096 values of the first caption's embedding. I expect all print statements outputing the same values but actually they don't. Am I misunderstanding something here?

Thanks for your help!

vaibhavad commented 1 month ago

Hi @wufeim,

Thank you for your interest in our work and raising this issue. While exploring this issue we actually uncovered a bug in our code (#74 ).

Firstly, the output of batch size 1 is very different from output of batch size >1 because of a bug in the implementation of bidirectional_llama.py. I have pushed a fix (#75 ). If you plan to use batch size of 1 while using encode, you should build lm2vec package from source using the latest changes. The bug does not impact when the batch size is greater than 1.

Regarding the output being different for batch size 2,3,4 - this is a know issue of transformers library. Basically, this happens due to accumulation of matrix multiplication errors, which is more pronounced in lower precision like bf16. Here is a detailed explanation by one of the maintainers of transformers library.

By changing the precision of fp32, the variability with batch size will reduce quite a lot but will never be zero (see tests run in the above detailed explanation)

Here are more related issues on the transformers library for reference: https://github.com/huggingface/transformers/issues/26869 https://github.com/huggingface/transformers/issues/27626

Another on llama.cpp https://github.com/ggerganov/llama.cpp/issues/3014

Hope this answers your question, let me know if you have any more queries.

vaibhavad commented 1 month ago

Closing as it is stale, feel free to re-open if the issue persists