some special words can not be encoded

McGill-NLP / llm2vec

Code for 'LLM2Vec: Large Language Models Are Secretly Powerful Text Encoders'

https://mcgill-nlp.github.io/llm2vec/

MIT License

816 stars 59 forks source link

some special words can not be encoded #80

Open Yan2266336 opened 1 month ago

Yan2266336 commented 1 month ago

I just followed your instructions of code to convert the llama 3-8b-instruct model into an embedded. When I use the l2v.encode() function to get embedding of sentence "Buddhism (religion/philosophy)", the code will raise an issue about "RuntimeError: The expanded size of the tensor (10) must match the existing size (12) at non-singleton dimension 0. Target sizes: [10]. Tensor sizes: [12]". Just some special words can raise this kind of issues. So do you know the reason of this problem?

vaibhavad commented 1 month ago

Hi, can you share a code snippet to reproduce the issue?

Yan2266336 commented 1 month ago

ok. Firstly, I converted the foundational llama 3-8b-instruct model into an embedder, named "Llama-3-8B-instruct-Emb".

Then, I follow your instructions to reload this model and use this model to get embedding.

In the end, I use "l2v.encode()" to generate embedding.

It is just an example of this issue. Most sentences can generate embedding correctly, but some sentences will raise this problem. So, I don't know how to solve this issue.

vaibhavad commented 1 month ago

Hi @Yan2266336,

I believe you are mixing two different ways of loading the model. The loading using transformers and trust_remote_code is required for loading Huggingface models, which have custom files that are needed for bidirectional attention. l2v.save will not save those files, hence if you load from a local directory, you are actually using a unidirectional model instead of bidirectional.

If you are saving using l2v.save, then you should be loading the model with l2v = LLM2Vec.from_pretrained method.

This code snippet is working without any errors on my end

import torch
from llm2vec import LLM2Vec

if __name__ == "__main__":
    l2v = LLM2Vec.from_pretrained(
        "meta-llama/Meta-Llama-3-8B-Instruct",
        device_map="cuda",
        torch_dtype=torch.bfloat16,
        low_cpu_mem_usage=True,
    )

    l2v.encode(["Buddhism (religion/philosophy)"])

Yan2266336 commented 1 month ago

Thank you so much for helping me to solve this issue. However, arises another issue when I use this code to get embeddings, just as shown in the figure.

Do you know what the problem is? Because I have tried this way to get embeddings, but it didn't work. Therefore, I used the code you shared in the hugging face to get the embeddings.

vaibhavad commented 1 month ago

This is a known issue, please upgrade to latest version of llm2vec, in which this issue is resolved

pip install llm2vec==0.1.8

Yan2266336 commented 1 month ago

It works. Thanks for helping me to solve these problems.

vaibhavad commented 1 month ago

No problem. Feel free to re-open the issue if you have any more questions

Yan2266336 commented 2 weeks ago

Hello, I have to bother you again. recently I Instruct-tuned a 'meta-llama/Meta-Llama-3-8B-Instruct' model and pushed it into my hugging face, this model is 'YBXL/Meta-Llama-3-8B-InstUMLS-Concept-train11e-06'. The model's structure is shown here:

However, I used the same way to load my model into the llm2vec framework, and the previous issue arose again.

I also tested the foundational llama-3-8b-instruct model and my previous fine-tuned llama model, they are still working. only the lasted one "YBXL/Meta-Llama-3-8B-InstUMLS-Concept-train11e-06", the llm2vec arose this problem. Could you please help me to solve this problem? Thank you so much.

Yan2266336 commented 2 weeks ago

It seems like there are some issues with responding, I didn't see any of your responses.

vaibhavad commented 1 week ago

@Yan2266336, the issue arises because the input text is not put in a proper template. This step happens here, but because the model name has changed, the if condition is not satisfied. Here is a quick workaround to override the function and apply Llama-3 template.

from llm2vec import LLM2Vec

import torch

class CustomModel(LLM2Vec):
    def prepare_for_tokenization(self, text):
        text = (
            "<|start_header_id|>user<|end_header_id|>\n\n"
            + text.strip()
            + "<|eot_id|>"
        )
        return text

l2v = CustomModel.from_pretrained(
    "YBXL/Meta-Llama-3-8B-InstUMLS-Concept-train11e-06",
    device_map="cuda",
    torch_dtype=torch.bfloat16,
    low_cpu_mem_usage=True
    )

l2v.encode(["Buddhism (religion/philosophy)"])

In the future, the prompt template will be specified outside the package (#56 )

Yan2266336 commented 1 week ago

Thank you so much. So, in the future, I just need to define a custom model as you described here to fit the prompt template, right? Unless you design a prompt template outside the package later on.

vaibhavad commented 6 days ago

Yes, exactly. Your understanding is correct.