McGill-NLP / llm2vec

Code for 'LLM2Vec: Large Language Models Are Secretly Powerful Text Encoders'
https://mcgill-nlp.github.io/llm2vec/
MIT License
1.17k stars 88 forks source link

How to get sentence embedding from last hidden state? #89

Closed InfAGI closed 3 months ago

InfAGI commented 4 months ago

inputs = l2v.tokenizer(doc, return_tensors="pt", padding=True, add_special_tokens=False) res = l2v.model(**inputs, output_hidden_states=True).last_hidden_state.mean(dim=1) Is this implementation equivalent to the encode function? Thank you!

vaibhavad commented 4 months ago

Hi @InfAGI,

Thanks for your interest in our work.

llm2vec library takes care of getting the sentence embedding from the last hidden state. Concretely, it is implemented in get_pooling function.

The pooling method can be chosen by providing an argument to LLM2Vec model, for example

l2v = LLM2Vec.from_pretrained(
    "McGill-NLP/LLM2Vec-Meta-Llama-3-8B-Instruct-mntp",
    peft_model_name_or_path="McGill-NLP/LLM2Vec-Meta-Llama-3-8B-Instruct-mntp-supervised",
    device_map="cuda" if torch.cuda.is_available() else "cpu",
    torch_dtype=torch.bfloat16,
    pooling_mode="mean",
)

Let me know if you have any further questions.

vaibhavad commented 3 months ago

Closing as it is stale. Feel free to re-open if you have any further questions.

andrewdotwang commented 1 week ago

I have a similar question - is it possible to obtain the the sentence embedding as a whole, (embeddings for each token) without any pooling methods?