MTEB Evaluation Running Time

stefanhgm commented 3 months ago

Hi everyone!

Thanks for developing LLM2Vec and making the source code available.

I was trying to reproduce LLM2Vec-Meta-Llama-3-8B-Instruct-mntp-supervised and train a model based on Llama 3.1 8B. I trained both models and now want to obtain results on the MTEB benchmark for comparison. Unfortunately, it seems to take very long to run the benchmark using the LLM2Vec models. I am currently done with the tasks CQADupstackWordpressRetrieval and ClimateFever (also see #135) and the next task (I think it is DBPedia) takes over 48h on a single A100 80GB. Is this the expected behavior? Can you share some insights about the running times of LLM2Vec on MTEB or share advice on how to speed it up?

I use the below snippet to run MTEB based on the script you provided:

    model = mteb.get_model(args.model_name, **model_kwargs)
    tasks = mteb.get_tasks(tasks=MTEB_MAIN_EN.tasks, languages=["eng"])
    evaluation = mteb.MTEB(tasks=tasks)
    results = evaluation.run(model, output_folder=args.output_dir)

Thanks for any help!

vaibhavad commented 2 months ago

Hi @stefanhgm,

Yes, unfortunately evaluating 7B models on MTEB is an extremely long and arduous process. The only thing that can help speed up the evaluation is multi-GPU setup, in case that is available.

The library support multi-GPU evaluation without any code changes.

stefanhgm commented 2 months ago

Hi @vaibhavad,

Thanks for coming back on this! My experience on 4 GPUs is that it only get ~2.5x faster. Can you maybe give me an estimate on the overall running time or the time you needed to run on DBPedia if that's available?

Otherwise I will just try it again with a longer time interval or more GPUs. Thank you!

vaibhavad commented 2 months ago

Unfortunately, I don't remember the running time of DBPedia and I don't have the log files anymore. However, I do remember that out of all tasks, MSMARCO took the longest, which has 7 hours on 8 A100 GPUs. So DBPedia will be less than that.

vaibhavad commented 2 months ago

Hi @stefanhgm,

I just ran evaluation of DBPedia for Llama 3.1 8B model, it took 2.5 hours on 8 X H100 80GB GPUs.

stefanhgm commented 2 months ago

Thank you! That was helpful.

stefanhgm commented 2 months ago

Hi @vaibhavad,

sorry, I stumbled across another issue: Do we actually have to run the tasks for the train, dev and test sets (as it is done by default) or does test suffice? Because it seems these are the uploaded scores and this would drastically remove the running time.

I use the following code snippet to run the MTEB benchmark in mteb_eval.py:

 tasks = mteb.get_tasks(tasks=MTEB_MAIN_EN.tasks, languages=["eng"])
 evaluation = mteb.MTEB(tasks=tasks)
 results = evaluation.run(model, verbosity=2, output_folder=args.output_dir)

I looked for alternatives to filter only for the test datasets, but I did not find a straightforward way to do it. What approach did you use for creating the results for the MTEB leaderboard?

Thank you!

stefanhgm commented 1 month ago

I am now trying it with the following code only using the test sets:

tasks_orig = mteb.get_tasks(tasks=MTEB_MAIN_EN.tasks, languages=["eng"])
# Remove MSMARCO because it is evaluated on dev set
tasks = [t for t in tasks_orig if "MSMARCO" not in t.metadata.name]
evaluation = mteb.MTEB(tasks=tasks)
# Only run on test set for leaderboard, exception: MSMARCO manually on dev set
results = evaluation.run(model, eval_splits=["test"], verbosity=2, output_folder=args.output_dir)

nasosger commented 1 month ago

Hi @stefanhgm, I am interested in llm2vec Mteb evaluation of custom models. I have trained a Gemma 2B with the bi-mntp-simcse setting, but cannot reproduce the evaluation script for the custom model. Could you provide me with some details in the modifications that you had to do on Mteb source code? I imagine that for every custom model, the same modifications should work, given the correct versions of the libraries.

vaibhavad commented 1 month ago

Hi @stefanhgm ,

Do we actually have to run the tasks for the train, dev and test sets (as it is done by default) or does test suffice?

Just the test suffices, I believe you already figured out a way to run on just dev/test sets with MTEB package. Let me know if you need anything else.

stefanhgm commented 2 days ago

Hi @nasosger

sorry for the very late reply. I basically changed the things I pointed out earlier. Here is my mteb_eval.py:

import argparse
import mteb
from mteb.benchmarks import MTEB_MAIN_EN
import json

if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument(
        "--model_name",
        type=str,
        default="McGill-NLP/LLM2Vec-Meta-Llama-3-8B-Instruct-mntp-supervised",
    )
    parser.add_argument("--task_name", type=str, default="STS16")
    parser.add_argument("--task_types", type=str, default="")
    parser.add_argument("--do_mteb_main_en", action="store_true", default=False)
    parser.add_argument(
        "--task_to_instructions_fp",
        type=str,
        default="test_configs/mteb/task_to_instructions.json",
    )
    parser.add_argument("--output_dir", type=str, default="results")

    args = parser.parse_args()

    model_kwargs = {}
    if args.task_to_instructions_fp is not None:
        with open(args.task_to_instructions_fp, "r") as f:
            task_to_instructions = json.load(f)
        model_kwargs["task_to_instructions"] = task_to_instructions

    model = mteb.get_model(args.model_name, **model_kwargs)

    if args.do_mteb_main_en:
        tasks_orig = mteb.get_tasks(tasks=MTEB_MAIN_EN.tasks, languages=["eng"])
        # Remove MSMARCO
        # "Note that the public leaderboard uses the test splits for all datasets except MSMARCO, where the "dev" split is used."
        # See: https://github.com/embeddings-benchmark/mteb
        tasks = [t for t in tasks_orig if "MSMARCO" not in t.metadata.name]
        assert len(tasks_orig) == 67 and len(tasks) == 66
    elif args.task_types:
        tasks = mteb.get_tasks(task_types=[args.task_types], languages=["eng"])
    else:
        tasks = mteb.get_tasks(tasks=[args.task_name], languages=["eng"])

    evaluation = mteb.MTEB(tasks=tasks)

    # Set logging to debug
    mteb.logger.setLevel(mteb.logging.DEBUG)
    # Only run on test set for leaderboard, exception: MSMARCO manually on dev set
    results = evaluation.run(model, eval_splits=["test"], verbosity=2, output_folder=args.output_dir)

McGill-NLP / llm2vec

MTEB Evaluation Running Time #140