embeddings-benchmark / mteb

MTEB: Massive Text Embedding Benchmark
https://arxiv.org/abs/2210.07316
Apache License 2.0
1.83k stars 245 forks source link

PreTrainedTokenizerFast._batch_encode_plus() got an unexpected keyword argument 'prompt_name' #1224

Open violenil opened 1 week ago

violenil commented 1 week ago

I get the following error when trying to run retrieval tasks:

TypeError: PreTrainedTokenizerFast._batch_encode_plus() got an unexpected keyword argument 'prompt_name'

Code snippet to reproduce:

import mteb
from transformers import AutoModel

model_name = "jinaai/jina-embeddings-v2-base-en"
model = AutoModel.from_pretrained(model_name, trust_remote_code=True)
tasks = mteb.get_tasks(tasks=["Banking77Classification"])
evaluation = mteb.MTEB(tasks=tasks)
results = evaluation.run(model, output_folder=f"results/{model_name}")

Full traceback:

ERROR:mteb.evaluation.MTEB:Error while evaluating Banking77Classification: PreTrainedTokenizerFast._batch_encode_plus() got an unexpected keyword argument 'prompt_name'
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File ".../venv/lib/python3.10/site-packages/mteb/evaluation/MTEB.py", line 422, in run
    raise e
  File ".../venv/lib/python3.10/site-packages/mteb/evaluation/MTEB.py", line 383, in run
    results, tick, tock = self._run_eval(
  File ".../venv/lib/python3.10/site-packages/mteb/evaluation/MTEB.py", line 260, in _run_eval
    results = task.evaluate(
  File ".../venv/lib/python3.10/site-packages/mteb/abstasks/AbsTaskClassification.py", line 102, in evaluate
    scores[hf_subset] = self._evaluate_subset(
  File ".../venv/lib/python3.10/site-packages/mteb/abstasks/AbsTaskClassification.py", line 178, in _evaluate_subset
    scores_exp, test_cache = evaluator(model, test_cache=test_cache)
  File ".../venv/lib/python3.10/site-packages/mteb/evaluation/evaluators/ClassificationEvaluator.py", line 296, in __call__
    X_train = model_encode(
  File ".../venv/lib/python3.10/site-packages/mteb/evaluation/evaluators/model_encode.py", line 40, in model_encode
    embeddings = model.encode(sentences, **kwargs)
  File ".../venv/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File ".../.cache/huggingface/modules/transformers_modules/jinaai/jina-bert-implementation/f3ec4cf7de7e561007f27c9efc7148b0bd713f81/modeling_bert.py", line 1220, in encode
    encoded_input = self.tokenizer(
  File ".../venv/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 3073, in __call__
    encodings = self._call_one(text=text, text_pair=text_pair, **all_kwargs)
  File ".../venv/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 3160, in _call_one
    return self.batch_encode_plus(
  File ".../venv/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 3356, in batch_encode_plus
    return self._batch_encode_plus(
TypeError: PreTrainedTokenizerFast._batch_encode_plus() got an unexpected keyword argument 'prompt_name'

A quick and easy fix would be to adjust the logic in mteb/evaluation/evaluators/model_encode.py to the following:

def model_encode(
    sentences: Sequence[str], *, model: Encoder, prompt_name: str | None, **kwargs
) -> np.ndarray:
    """A wrapper function around the model.encode method that handles the prompt_name argument and standardizes the output to a numpy array.

    Args:
        sentences: The sentences to encode
        model: The model to use for encoding
        prompt_name: The prompt name to use for encoding
        **kwargs: Additional arguments to pass to the model.encode method
    """
    if prompt_name and getattr(model, "prompts", None) and prompt_name in model.prompts:
        kwargs["prompt_name"] = prompt_name
    logger.info(f"Encoding {len(sentences)} sentences.")

    embeddings = model.encode(sentences, **kwargs)
    if isinstance(embeddings, torch.Tensor):
        embeddings = embeddings.cpu().detach().float()

    return np.asarray(embeddings)

In other words, don't add the 'prompt_name' arg if the model does not have an attribute 'prompts' and the 'prompt_name' isn't in the model's prompts dictionary. I'm happy to make a PR if the above behaviour is expected.

Muennighoff commented 3 days ago

Great catch - yes I think we should change that as you write & a PR would be amazing but maybe we should get thoughts from @KennethEnevoldsen first?

KennethEnevoldsen commented 24 minutes ago

Hmm not sure about this one. Currently, we don't expect AutoModels as input as they don't follow the interface. We instead inspect the user to load models using sentenceTransformers by default. Is there are reason why this can't be done here?