embeddings-benchmark / mteb

MTEB: Massive Text Embedding Benchmark
https://arxiv.org/abs/2210.07316
Apache License 2.0
1.61k stars 211 forks source link
benchmark bitext-mining clustering information-retrieval multilingual-nlp neural-search reranking retrieval sbert semantic-search sentence-transformers sgpt sts text-classification text-embedding

Massive Text Embedding Benchmark

GitHub release GitHub release License Downloads

Installation | Usage | Leaderboard | Documentation | Citing

Installation

pip install mteb

Usage

import mteb
from sentence_transformers import SentenceTransformer

# Define the sentence-transformers model name
model_name = "average_word_embeddings_komninos"
# or directly from huggingface:
# model_name = "sentence-transformers/all-MiniLM-L6-v2"

model = SentenceTransformer(model_name)
tasks = mteb.get_tasks(tasks=["Banking77Classification"])
evaluation = mteb.MTEB(tasks=tasks)
results = evaluation.run(model, output_folder=f"results/{model_name}")
mteb available_tasks

mteb run -m sentence-transformers/all-MiniLM-L6-v2 \
    -t Banking77Classification  \
    --verbosity 3

# if nothing is specified default to saving the results in the results/{model_name} folder


Advanced Usage (click to unfold) ## Advanced Usage ### Dataset selection Datasets can be selected by providing the list of datasets, but also * by their task (e.g. "Clustering" or "Classification") ```python tasks = mteb.get_tasks(task_types=["Clustering", "Retrieval"]) # Only select clustering and retrieval tasks ``` * by their categories e.g. "s2s" (sentence to sentence) or "p2p" (paragraph to paragraph) ```python tasks = mteb.get_tasks(categories=["s2s", "p2p"]) # Only select sentence2sentence and paragraph2paragraph datasets ``` * by their languages ```python tasks = mteb.get_tasks(languages=["eng", "deu"]) # Only select datasets which contain "eng" or "deu" (iso 639-3 codes) ``` You can also specify which languages to load for multilingual/cross-lingual tasks like below: ```python import mteb tasks = [ mteb.get_task("AmazonReviewsClassification", languages = ["eng", "fra"]), mteb.get_task("BUCCBitextMining", languages = ["deu"]), # all subsets containing "deu" ] # or you can select specific huggingface subsets like this: from mteb.tasks import AmazonReviewsClassification, BUCCBitextMining evaluation = mteb.MTEB(tasks=[ AmazonReviewsClassification(hf_subsets=["en", "fr"]) # Only load "en" and "fr" subsets of Amazon Reviews BUCCBitextMining(hf_subsets=["de-en"]), # Only load "de-en" subset of BUCC ]) # for an example of a HF subset see "Subset" in the dataset viewer at: https://huggingface.co/datasets/mteb/bucc-bitext-mining ``` There are also presets available for certain task collections, e.g. to select the 56 English datasets that form the "Overall MTEB English leaderboard": ```python from mteb import MTEB_MAIN_EN evaluation = mteb.MTEB(tasks=MTEB_MAIN_EN, task_langs=["en"]) ``` ### Evaluation split You can evaluate only on `test` splits of all tasks by doing the following: ```python evaluation.run(model, eval_splits=["test"]) ``` Note that the public leaderboard uses the test splits for all datasets except MSMARCO, where the "dev" split is used. ### Using a custom model Models should implement the following interface, implementing an `encode` function taking as inputs a list of sentences, and returning a list of embeddings (embeddings can be `np.array`, `torch.tensor`, etc.). For inspiration, you can look at the [mteb/mtebscripts repo](https://github.com/embeddings-benchmark/mtebscripts) used for running diverse models via SLURM scripts for the paper. ```python class MyModel(): def encode( self, sentences: list[str], **kwargs: Any ) -> torch.Tensor | np.ndarray: """Encodes the given sentences using the encoder. Args: sentences: The sentences to encode. **kwargs: Additional arguments to pass to the encoder. Returns: The encoded sentences. """ pass model = MyModel() tasks = mteb.get_task("Banking77Classification") evaluation = MTEB(tasks=tasks) evaluation.run(model) ``` If you'd like to use different encoding functions for query and corpus when evaluating on Retrieval or Reranking tasks, you can add separate methods for `encode_queries` and `encode_corpus`. If these methods exist, they will be automatically used for those tasks. You can refer to the `DRESModel` at `mteb/evaluation/evaluators/RetrievalEvaluator.py` for an example of these functions. ```python class MyModel(): def encode_queries(self, queries: list[str], **kwargs) -> list[np.ndarray] | list[torch.Tensor]: """ Returns a list of embeddings for the given sentences. Args: queries: List of sentences to encode Returns: List of embeddings for the given sentences """ pass def encode_corpus(self, corpus: list[str] | list[dict[str, str]], **kwargs) -> list[np.ndarray] | list[torch.Tensor]: """ Returns a list of embeddings for the given sentences. Args: corpus: List of sentences to encode or list of dictionaries with keys "title" and "text" Returns: List of embeddings for the given sentences """ pass ``` ### Evaluating on a custom dataset To evaluate on a custom task, you can run the following code on your custom task. See [how to add a new task](docs/adding_a_dataset.md), for how to create a new task in MTEB. ```python from mteb import MTEB from mteb.abstasks.AbsTaskReranking import AbsTaskReranking from sentence_transformers import SentenceTransformer class MyCustomTask(AbsTaskReranking): ... model = SentenceTransformer("average_word_embeddings_komninos") evaluation = MTEB(tasks=[MyCustomTask()]) evaluation.run(model) ```


Documentation

Documentation
📋 Tasks  Overview of available tasks
📈 Leaderboard The interactive leaderboard of the benchmark
🤖 Adding a model Information related to how to submit a model to the leaderboard
👩‍🔬 Reproducible workflows Information related to how to reproduce and create reproducible workflows with MTEB
👩‍💻 Adding a dataset How to add a new task/dataset to MTEB  
👩‍💻 Adding a leaderboard tab How to add a new leaderboard tab to MTEB  
🤝 Contributing How to contribute to MTEB and set it up for development
🌐 MMTEB An open-source effort to extend MTEB to cover a broad set of languages  

Citing

MTEB was introduced in "MTEB: Massive Text Embedding Benchmark", feel free to cite:

@article{muennighoff2022mteb,
  doi = {10.48550/ARXIV.2210.07316},
  url = {https://arxiv.org/abs/2210.07316},
  author = {Muennighoff, Niklas and Tazi, Nouamane and Magne, Lo{\"\i}c and Reimers, Nils},
  title = {MTEB: Massive Text Embedding Benchmark},
  publisher = {arXiv},
  journal={arXiv preprint arXiv:2210.07316},  
  year = {2022}
}

You may also want to read and cite the amazing work that has extended MTEB & integrated new datasets:

For works that have used MTEB for benchmarking, you can find them on the leaderboard.