Paper segment: Speeding up MTEB (English) + co2 impact

KennethEnevoldsen commented 2 months ago

This segment is intended to aggregate the performance gains over our changes in a reasonable format.

Related to: #836, #835

@imenelydiaker could imagine that you might be interested in this?

imenelydiaker commented 2 months ago

This segment is intended to aggregate the performance gains over our changes in a reasonable format.

Related to: #836, #835

@imenelydiaker could imagine that you might be interested in this?

Yes I'm interested in this, happy to collaborate with any person who's interested!

KennethEnevoldsen commented 2 months ago

Perfect @imenelydiaker I will assign you to feel free to recruit collaborators. I am more than @mrshu might already have done some of the work needed for this task.

imenelydiaker commented 1 month ago

Hey @mrshu! Did you finish running the English benchmark?

mrshu commented 1 month ago

@imenelydiaker unfortunately, it's still running. With sentence-t5-xxl some of the datsets take days to evaluate (e.g. FEVER or MSMARCO). I am running this at a single H100 and didn't really look at running it in parallel but I am afraid I am running out of ideas otherwise -- the model is pretty big and encoding a single batch of 391 samples takes roughly 55 minutes.

KennethEnevoldsen commented 1 month ago

Might be worth running it with a smaller model? or at least start the run on the new english benchmark

imenelydiaker commented 1 month ago

Yeah should definetly use a smaller models, I'd go for one of the baseline models we used: multilingual-e5-small to be consistent with the rest of evaluations and the paper in general. @KennethEnevoldsen @mrshu wdyt?

KennethEnevoldsen commented 1 month ago

Yea - a larger model like e5-large would produce a larger effect so I would probably go for that, but a 7b or larger model might be too frustrating to run.

imenelydiaker commented 1 month ago

Ok then let's go for multilingual-e5-large.

For the fast version of MTEB, I wonder if it's not worth waiting for this to complete #836 ? All Clustering tasks have been converted to Fast for the moment, but it would be better to have a faster version of retrieval also.

@mrshu would you be able to run the experiments again on the English benchmark with multilingual-e5-large? We should start with the old English benchmark only, and we'll wait a little bit to start running the fast version (just wait few weeks to see if #836 completes or ignore retrieval and focus on clustering + classification downsampling only).

mrshu commented 1 month ago

@imenelydiaker I am on it :)

KennethEnevoldsen commented 1 month ago

@mrshu you will have to make sure that you use the correct implementation of e5 - I implement e5 large in #876.

mrshu commented 1 month ago

@KennethEnevoldsen ugh, is there a chance just running the code in https://github.com/embeddings-benchmark/mteb/commit/4e1bab4ef964291aadacf808465d54e93d3db4cc will save wrong data?

KennethEnevoldsen commented 1 month ago

The e5 model will not produce the same results as the model expect a slightly formatted input so it should perform slightly better with the PR (especially on retrieval)

mrshu commented 1 month ago

Thanks @KennethEnevoldsen.

The processing has failed with the following anyhow:

INFO:mteb.evaluation.evaluators.RetrievalEvaluator:For evaluation, we ignore identical query and document ids (default), please explicitly set ``ignore_identical_ids=False`` to ignore this.
ERROR:mteb.evaluation.MTEB:Error while evaluating ArguAna: Expected object_relevance_per_qid dictionary and measures set.
Traceback (most recent call last):
  File "/home/mrshu/mteb/scripts/run_mteb_english.py", line 115, in <module>
    evaluation.run(model, output_folder=f"carbon_results/{model_name}", eval_splits=eval_splits, co2_tracker=True)
  File "/home/mrshu/mteb/mteb/evaluation/MTEB.py", line 319, in run
    raise e
  File "/home/mrshu/mteb/mteb/evaluation/MTEB.py", line 297, in run
    results, tick, tock = self._run_eval(task, model, split, output_folder, **kwargs)
  File "/home/mrshu/mteb/mteb/evaluation/MTEB.py", line 216, in _run_eval
    results = task.evaluate(model, split, output_folder=output_folder, **kwargs)
  File "/home/mrshu/mteb/mteb/abstasks/AbsTaskRetrieval.py", line 165, in evaluate
    scores = self._evaluate_monolingual(retriever, corpus, queries, relevant_docs, None, **kwargs)
  File "/home/mrshu/mteb/mteb/abstasks/AbsTaskRetrieval.py", line 191, in _evaluate_monolingual
    ndcg, _map, recall, precision = retriever.evaluate(relevant_docs, results, retriever.k_values, ignore_identical_ids=kwargs.get("ignore_identical_ids", True))
  File "/home/mrshu/mteb/mteb/evaluation/evaluators/RetrievalEvaluator.py", line 207, in evaluate
    evaluator = pytrec_eval.RelevanceEvaluator(qrels, {map_string, ndcg_string, recall_string, precision_string})
  File "/home/mrshu/mteb/.v/lib/python3.10/site-packages/pytrec_eval/__init__.py", line 59, in __init__
    super().__init__(query_relevance=query_relevance, measures=measures, relevance_level=relevance_level, judged_docs_only_flag=judged_docs_only_flag)
TypeError: Expected object_relevance_per_qid dictionary and measures set.

mrshu commented 1 month ago

Would it make sense to merge your changes in and re-run the whole pipeline again?

KennethEnevoldsen commented 1 month ago

@imenelydiaker are you familiar with this error? (probably related to #833)

KennethEnevoldsen commented 1 month ago

@mrshu will you create an issue with this. I think #833 and #876 need to be merged beforehand.

imenelydiaker commented 1 month ago

@imenelydiaker are you familiar with this error? (probably related to #833)

Nope haven't seen this error before. #833 has not been merged yet so it's not related imo, it may be related to MIRACL Reranking #830 and Abstention #854, these are the latest PRs that updated the Retrieval evaluator

KennethEnevoldsen commented 1 month ago

Hmm couldn't reproduce the error on the latest branch:

from sentence_transformers import SentenceTransformer

import mteb

tasks = mteb.get_tasks(tasks=["ArguAna"])

eval = mteb.MTEB(tasks=tasks)

model = SentenceTransformer("all-MiniLM-L6-v2")

eval.run(model, output_folder="temp")  # type: ignore

mrshu commented 1 month ago

@KennethEnevoldsen that's the tough part -- this is running on a very old branch intentionally (the code in https://github.com/embeddings-benchmark/mteb/commit/4e1bab4ef964291aadacf808465d54e93d3db4cc builds off of the 1.2.0 release), to show the differences between the time when the benchmark was optimized to run faster and at the same time use less CO2.

This didn't cause an issue for me previously with other models, so I wonder whether it might be multilingual-e5-large that might cause trouble here?

In any case, any suggestions would be greatly appreciated!

KennethEnevoldsen commented 1 month ago

Ahh right! sorry yea. You should still implement the encode_corpus, encode_query for the e5 (or potentially we can choose a model that does not require a custom implementation - something within bert-base to large which is somewhat standard). There might be a bug in the specific version can you run using a small model on the arguana just to check?

KennethEnevoldsen commented 1 month ago

@Muennighoff you might be more aware of this bug than me

Muennighoff commented 1 month ago

Also havn't seen that bug before - maybe an issue with the pytrec_eval version you have installed? Maybe make sure to install whatever was in the requirements.txt file at that point in time, not what's there now? 🤔

mrshu commented 1 month ago

Thanks @Muennighoff and @KennethEnevoldsen. Unfortunately, pytrec_eval hasn't been updated in the past 3 years (https://pypi.org/project/pytrec-eval/) so it doesn't seem like it will be that.

I'll try the smaller model just to check.

mrshu commented 1 month ago

The sentence-t5-xxl processing has finally finished -- the result can be seen in https://github.com/embeddings-benchmark/mteb/compare/main...mrshu:mteb:mrshu/port-carbon-emissions-estimation

embeddings-benchmark / mteb

Paper segment: Speeding up MTEB (English) + co2 impact #838