McGill-NLP / llm2vec

Code for 'LLM2Vec: Large Language Models Are Secretly Powerful Text Encoders'
https://mcgill-nlp.github.io/llm2vec/
MIT License
1.08k stars 78 forks source link

How to encode instructions? #20

Closed bhattg closed 3 months ago

bhattg commented 4 months ago

Hi!

Suppose I want to compute the sematic similarity between bunch of instruction, how do I go about that? Following is in my mind --

"Encode the following for semantic search : [instruction]"

Where the [instruction] will be replaced with "### Instruction: "

Is this the right way? Does LLM2Vec recognize the special characters such as "###"?

bhattg commented 4 months ago

Any updates on this? I'd appreciate the help of the authors!

vaibhavad commented 3 months ago

Hi @bhattg,

Please refer to Inference guide in README. Specifically the input can be in the form of [[instruction1, text1], [instruction2, text2]] or [text1, text2].

You can also refer to an example for Semantic Textual Similarity task provided in the repo.

bhattg commented 3 months ago

To ensure I followed properly, in my case, I want to find the semantic similarity between the two instructions. Therefore, I'd be feeding something like -- ["Retrieve semantically similar text", "###Instruction: {some prompt} \n. ###Response {some response}"]?

One more question, how sensitive are LLM2vec model for the instructions? Say if I had used "Encode the following for semantic search" instead of "Retrieve semantically similar text" would it drastically change the results?

Thanks!

vaibhavad commented 3 months ago

I think the best approach would be to encode ["Retrieve semantically similar text", "<<< Instruction1 >>>"] and ["Retrieve semantically similar text", "<<< Instruction2 >>>"] and then measure the similarity between their embeddings.

Currently we don't have a well defined study on the robustness of instructions, however, the model seems to work well qualitatively with different variations in instructions.