Open KennethEnevoldsen opened 2 months ago
This segment is intended to aggregate the performance gains over our changes in a reasonable format.
Related to: #836, #835
@imenelydiaker could imagine that you might be interested in this?
Yes I'm interested in this, happy to collaborate with any person who's interested!
Perfect @imenelydiaker I will assign you to feel free to recruit collaborators. I am more than @mrshu might already have done some of the work needed for this task.
Hey @mrshu! Did you finish running the English benchmark?
@imenelydiaker unfortunately, it's still running. With sentence-t5-xxl
some of the datsets take days to evaluate (e.g. FEVER or MSMARCO). I am running this at a single H100 and didn't really look at running it in parallel but I am afraid I am running out of ideas otherwise -- the model is pretty big and encoding a single batch of 391 samples takes roughly 55 minutes.
Might be worth running it with a smaller model? or at least start the run on the new english benchmark
Yeah should definetly use a smaller models, I'd go for one of the baseline models we used: multilingual-e5-small
to be consistent with the rest of evaluations and the paper in general.
@KennethEnevoldsen @mrshu wdyt?
Yea - a larger model like e5-large would produce a larger effect so I would probably go for that, but a 7b or larger model might be too frustrating to run.
Ok then let's go for multilingual-e5-large
.
For the fast version of MTEB, I wonder if it's not worth waiting for this to complete #836 ? All Clustering tasks have been converted to Fast for the moment, but it would be better to have a faster version of retrieval also.
@mrshu would you be able to run the experiments again on the English benchmark with multilingual-e5-large
? We should start with the old English benchmark only, and we'll wait a little bit to start running the fast version (just wait few weeks to see if #836 completes or ignore retrieval and focus on clustering + classification downsampling only).
@imenelydiaker I am on it :)
@mrshu you will have to make sure that you use the correct implementation of e5 - I implement e5 large in #876.
@KennethEnevoldsen ugh, is there a chance just running the code in https://github.com/embeddings-benchmark/mteb/commit/4e1bab4ef964291aadacf808465d54e93d3db4cc will save wrong data?
The e5 model will not produce the same results as the model expect a slightly formatted input so it should perform slightly better with the PR (especially on retrieval)
Thanks @KennethEnevoldsen.
The processing has failed with the following anyhow:
INFO:mteb.evaluation.evaluators.RetrievalEvaluator:For evaluation, we ignore identical query and document ids (default), please explicitly set ``ignore_identical_ids=False`` to ignore this.
ERROR:mteb.evaluation.MTEB:Error while evaluating ArguAna: Expected object_relevance_per_qid dictionary and measures set.
Traceback (most recent call last):
File "/home/mrshu/mteb/scripts/run_mteb_english.py", line 115, in <module>
evaluation.run(model, output_folder=f"carbon_results/{model_name}", eval_splits=eval_splits, co2_tracker=True)
File "/home/mrshu/mteb/mteb/evaluation/MTEB.py", line 319, in run
raise e
File "/home/mrshu/mteb/mteb/evaluation/MTEB.py", line 297, in run
results, tick, tock = self._run_eval(task, model, split, output_folder, **kwargs)
File "/home/mrshu/mteb/mteb/evaluation/MTEB.py", line 216, in _run_eval
results = task.evaluate(model, split, output_folder=output_folder, **kwargs)
File "/home/mrshu/mteb/mteb/abstasks/AbsTaskRetrieval.py", line 165, in evaluate
scores = self._evaluate_monolingual(retriever, corpus, queries, relevant_docs, None, **kwargs)
File "/home/mrshu/mteb/mteb/abstasks/AbsTaskRetrieval.py", line 191, in _evaluate_monolingual
ndcg, _map, recall, precision = retriever.evaluate(relevant_docs, results, retriever.k_values, ignore_identical_ids=kwargs.get("ignore_identical_ids", True))
File "/home/mrshu/mteb/mteb/evaluation/evaluators/RetrievalEvaluator.py", line 207, in evaluate
evaluator = pytrec_eval.RelevanceEvaluator(qrels, {map_string, ndcg_string, recall_string, precision_string})
File "/home/mrshu/mteb/.v/lib/python3.10/site-packages/pytrec_eval/__init__.py", line 59, in __init__
super().__init__(query_relevance=query_relevance, measures=measures, relevance_level=relevance_level, judged_docs_only_flag=judged_docs_only_flag)
TypeError: Expected object_relevance_per_qid dictionary and measures set.
Would it make sense to merge your changes in and re-run the whole pipeline again?
@imenelydiaker are you familiar with this error? (probably related to #833)
@mrshu will you create an issue with this. I think #833 and #876 need to be merged beforehand.
@imenelydiaker are you familiar with this error? (probably related to #833)
Nope haven't seen this error before. #833 has not been merged yet so it's not related imo, it may be related to MIRACL Reranking #830 and Abstention #854, these are the latest PRs that updated the Retrieval evaluator
Hmm couldn't reproduce the error on the latest branch:
from sentence_transformers import SentenceTransformer
import mteb
tasks = mteb.get_tasks(tasks=["ArguAna"])
eval = mteb.MTEB(tasks=tasks)
model = SentenceTransformer("all-MiniLM-L6-v2")
eval.run(model, output_folder="temp") # type: ignore
@KennethEnevoldsen that's the tough part -- this is running on a very old branch intentionally (the code in https://github.com/embeddings-benchmark/mteb/commit/4e1bab4ef964291aadacf808465d54e93d3db4cc builds off of the 1.2.0
release), to show the differences between the time when the benchmark was optimized to run faster and at the same time use less CO2.
This didn't cause an issue for me previously with other models, so I wonder whether it might be multilingual-e5-large
that might cause trouble here?
In any case, any suggestions would be greatly appreciated!
Ahh right! sorry yea. You should still implement the encode_corpus, encode_query for the e5 (or potentially we can choose a model that does not require a custom implementation - something within bert-base to large which is somewhat standard). There might be a bug in the specific version can you run using a small model on the arguana just to check?
@Muennighoff you might be more aware of this bug than me
Also havn't seen that bug before - maybe an issue with the pytrec_eval
version you have installed? Maybe make sure to install whatever was in the requirements.txt file at that point in time, not what's there now? 🤔
Thanks @Muennighoff and @KennethEnevoldsen. Unfortunately, pytrec_eval
hasn't been updated in the past 3 years (https://pypi.org/project/pytrec-eval/) so it doesn't seem like it will be that.
I'll try the smaller model just to check.
The sentence-t5-xxl
processing has finally finished -- the result can be seen in https://github.com/embeddings-benchmark/mteb/compare/main...mrshu:mteb:mrshu/port-carbon-emissions-estimation
This segment is intended to aggregate the performance gains over our changes in a reasonable format.
Related to: #836, #835
@imenelydiaker could imagine that you might be interested in this?