embeddings-benchmark / mteb

MTEB: Massive Text Embedding Benchmark
https://arxiv.org/abs/2210.07316
Apache License 2.0
1.98k stars 276 forks source link

"KeyError: 'document' not found and no similar keys were found. #1445

Open LeMoussel opened 1 week ago

LeMoussel commented 1 week ago

With HIT-TMG/KaLM-embedding-multilingual-mini-instruct-v1, I have the following error:

Loader not specified for model HIT-TMG/KaLM-embedding-multilingual-mini-instruct-v1, loading using sentence transformers.
Traceback (most recent call last):
  File "/home/dev/Python/AI/MTEB/mteb_fr.py", line 252, in <module>
    mteb_model = mteb.get_model(
                 ^^^^^^^^^^^^^^^
  File "/home/dev/Python/AI/MTEB/venv/lib/python3.12/site-packages/mteb/models/overview.py", line 126, in get_model
    model = meta.load_model(**kwargs)
            ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/dev/Python/AI/MTEB/venv/lib/python3.12/site-packages/mteb/model_meta.py", line 120, in load_model
    model: Encoder = loader(**kwargs)  # type: ignore
                     ^^^^^^^^^^^^^^^^
  File "/home/dev/Python/AI/MTEB/venv/lib/python3.12/site-packages/mteb/model_meta.py", line 37, in sentence_transformers_loader
    return SentenceTransformerWrapper(model=model_name, revision=revision, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/dev/Python/AI/MTEB/venv/lib/python3.12/site-packages/mteb/models/sentence_transformer_wrapper.py", line 48, in __init__
    model_prompts = self.validate_task_to_prompt_name(self.model.prompts)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/dev/Python/AI/MTEB/venv/lib/python3.12/site-packages/mteb/models/wrapper.py", line 81, in validate_task_to_prompt_name
    task = mteb.get_task(task_name=task_name)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/dev/Python/AI/MTEB/venv/lib/python3.12/site-packages/mteb/overview.py", line 318, in get_task
    raise KeyError(suggestion)
KeyError: "KeyError: 'document' not found and no similar keys were found."
Samoed commented 1 week ago

The issue is that this model specifies a prompt, but in MTEB, we have different prompts for tasks, which causes an error. Since this is an instruction model, it would be better to use it with InstructWrapper. Example for e5-instruct models.

LeMoussel commented 1 week ago

OK. I do this;

    # https://huggingface.co/jinaai/jina-embeddings-v3/discussions/75
    MODEL_NAME = "HIT-TMG/KaLM-embedding-multilingual-mini-instruct-v1"
    MODEL_URL = 'https://huggingface.co/HIT-TMG/KaLM-embedding-multilingual-mini-instruct-v1'

    OUPUT_FOLDER = "results"

    mteb_model = mteb.get_model(
        MODEL_NAME,
        device="cuda" if torch.cuda.is_available() else "cpu",
    )

    tasks = mteb.get_tasks(
        tasks=TASK_LIST, languages=["fra"]
    )

    evaluation = mteb.MTEB(tasks=tasks)
    mteb_results = evaluation.run(
        mteb_model,
        eval_splits=["test"],
        output_folder=f"{OUPUT_FOLDER}/{MODEL_NAME}",
    )

How can I use InstructWrapper in this case?

Samoed commented 1 week ago

You can run this model like this

import mteb
from mteb.models.instruct_wrapper import instruct_wrapper

mteb_model = instruct_wrapper(
    model_name_or_path="HIT-TMG/KaLM-embedding-multilingual-mini-instruct-v1",
    instruction_template="Instruct: {instruction} \n Query: ",
    attn="cccc",
    pooling_method="mean",
    mode="embedding",
    normalized=True,
)

tasks = mteb.get_tasks(
    tasks=["SciDocsRR"]
)

evaluation = mteb.MTEB(tasks=tasks)
mteb_results = evaluation.run(
    mteb_model,
)

It would be very nice if you could add this model to the models folder with the filled metadata

LeMoussel commented 1 week ago

It is my pleasure to help you by adding this model to the models folder with the metadata filled in, but I am new to using MTEB. What should I do to add this model to the models folder with the metadata filled in?

Samoed commented 1 week ago

You should fill in the information similar to the e5_instruct models and run some tasks to ensure that this implementation matches the author's.