Open stefanhgm opened 3 months ago
Hi @stefanhgm,
Yes, unfortunately evaluating 7B models on MTEB is an extremely long and arduous process. The only thing that can help speed up the evaluation is multi-GPU setup, in case that is available.
The library support multi-GPU evaluation without any code changes.
Hi @vaibhavad,
Thanks for coming back on this! My experience on 4 GPUs is that it only get ~2.5x faster. Can you maybe give me an estimate on the overall running time or the time you needed to run on DBPedia if that's available?
Otherwise I will just try it again with a longer time interval or more GPUs. Thank you!
Unfortunately, I don't remember the running time of DBPedia and I don't have the log files anymore. However, I do remember that out of all tasks, MSMARCO took the longest, which has 7 hours on 8 A100 GPUs. So DBPedia will be less than that.
Hi @stefanhgm,
I just ran evaluation of DBPedia for Llama 3.1 8B model, it took 2.5 hours on 8 X H100 80GB GPUs.
Thank you! That was helpful.
Hi @vaibhavad,
sorry, I stumbled across another issue: Do we actually have to run the tasks for the train
, dev
and test
sets (as it is done by default) or does test
suffice? Because it seems these are the uploaded scores and this would drastically remove the running time.
I use the following code snippet to run the MTEB benchmark in mteb_eval.py
:
tasks = mteb.get_tasks(tasks=MTEB_MAIN_EN.tasks, languages=["eng"])
evaluation = mteb.MTEB(tasks=tasks)
results = evaluation.run(model, verbosity=2, output_folder=args.output_dir)
I looked for alternatives to filter only for the test
datasets, but I did not find a straightforward way to do it. What approach did you use for creating the results for the MTEB leaderboard?
Thank you!
I am now trying it with the following code only using the test
sets:
tasks_orig = mteb.get_tasks(tasks=MTEB_MAIN_EN.tasks, languages=["eng"])
# Remove MSMARCO because it is evaluated on dev set
tasks = [t for t in tasks_orig if "MSMARCO" not in t.metadata.name]
evaluation = mteb.MTEB(tasks=tasks)
# Only run on test set for leaderboard, exception: MSMARCO manually on dev set
results = evaluation.run(model, eval_splits=["test"], verbosity=2, output_folder=args.output_dir)
Hi @stefanhgm, I am interested in llm2vec Mteb evaluation of custom models. I have trained a Gemma 2B with the bi-mntp-simcse setting, but cannot reproduce the evaluation script for the custom model. Could you provide me with some details in the modifications that you had to do on Mteb source code? I imagine that for every custom model, the same modifications should work, given the correct versions of the libraries.
Hi @stefanhgm ,
Do we actually have to run the tasks for the train, dev and test sets (as it is done by default) or does test suffice?
Just the test suffices, I believe you already figured out a way to run on just dev/test sets with MTEB package. Let me know if you need anything else.
Hi @nasosger
sorry for the very late reply. I basically changed the things I pointed out earlier. Here is my mteb_eval.py
:
import argparse
import mteb
from mteb.benchmarks import MTEB_MAIN_EN
import json
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument(
"--model_name",
type=str,
default="McGill-NLP/LLM2Vec-Meta-Llama-3-8B-Instruct-mntp-supervised",
)
parser.add_argument("--task_name", type=str, default="STS16")
parser.add_argument("--task_types", type=str, default="")
parser.add_argument("--do_mteb_main_en", action="store_true", default=False)
parser.add_argument(
"--task_to_instructions_fp",
type=str,
default="test_configs/mteb/task_to_instructions.json",
)
parser.add_argument("--output_dir", type=str, default="results")
args = parser.parse_args()
model_kwargs = {}
if args.task_to_instructions_fp is not None:
with open(args.task_to_instructions_fp, "r") as f:
task_to_instructions = json.load(f)
model_kwargs["task_to_instructions"] = task_to_instructions
model = mteb.get_model(args.model_name, **model_kwargs)
if args.do_mteb_main_en:
tasks_orig = mteb.get_tasks(tasks=MTEB_MAIN_EN.tasks, languages=["eng"])
# Remove MSMARCO
# "Note that the public leaderboard uses the test splits for all datasets except MSMARCO, where the "dev" split is used."
# See: https://github.com/embeddings-benchmark/mteb
tasks = [t for t in tasks_orig if "MSMARCO" not in t.metadata.name]
assert len(tasks_orig) == 67 and len(tasks) == 66
elif args.task_types:
tasks = mteb.get_tasks(task_types=[args.task_types], languages=["eng"])
else:
tasks = mteb.get_tasks(tasks=[args.task_name], languages=["eng"])
evaluation = mteb.MTEB(tasks=tasks)
# Set logging to debug
mteb.logger.setLevel(mteb.logging.DEBUG)
# Only run on test set for leaderboard, exception: MSMARCO manually on dev set
results = evaluation.run(model, eval_splits=["test"], verbosity=2, output_folder=args.output_dir)
Hi everyone!
Thanks for developing LLM2Vec and making the source code available.
I was trying to reproduce
LLM2Vec-Meta-Llama-3-8B-Instruct-mntp-supervised
and train a model based on Llama 3.1 8B. I trained both models and now want to obtain results on the MTEB benchmark for comparison. Unfortunately, it seems to take very long to run the benchmark using the LLM2Vec models. I am currently done with the tasksCQADupstackWordpressRetrieval
andClimateFever
(also see #135) and the next task (I think it isDBPedia
) takes over 48h on a single A100 80GB. Is this the expected behavior? Can you share some insights about the running times of LLM2Vec on MTEB or share advice on how to speed it up?I use the below snippet to run MTEB based on the script you provided:
Thanks for any help!