embeddings-benchmark / mteb

MTEB: Massive Text Embedding Benchmark
https://arxiv.org/abs/2210.07316
Apache License 2.0
1.9k stars 255 forks source link

Retrieval task loading HF dataset #173

Closed SantiDianaClibrain closed 4 months ago

SantiDianaClibrain commented 11 months ago

I have a question. Are BeIR datasets in HuggingFace the same version as the ones located in their propietary URL? Is USE_HF_DATASETS = False safe to set it to True?

Muennighoff commented 11 months ago

Pretty sure they are the same - maybe @thakur-nandan can confirm

SantiDianaClibrain commented 11 months ago

Okay! I am having an issue. Let's see if you can give me some help. I am working with a fork of your repository.

I am trying to reproduce this code:

for model_name in models:
    model = SentenceTransformer(model_name)

    evaluation = MTEB(tasks=[
            SCIDOCS(langs=["en"])
    ])
    evaluation.run(model, output_folder=f"results/{model_name}", eval_splits=["test"])```

    The problem arises when I want to use a HF dataset.

    For that purpose, I go to `BeirTask.py` and enable `USE_HF_DATASETS=True`. 

    Then, I run the script and receive this error:

INFO:mteb.evaluation.MTEB:

** Evaluating SCIDOCS ** INFO:mteb.evaluation.MTEB:Loading dataset for SCIDOCS INFO:mteb.abstasks.BeIRTask:Using HFDataLoader for BeIR WARNING:beir.datasets.data_loader_hf:A huggingface repository is provided. This will override the data_folder, prefix and *_file arguments. INFO:beir.datasets.data_loader_hf:Loading Corpus... INFO:beir.datasets.data_loader_hf:Loaded 25657 TEST Documents. INFO:beir.datasets.data_loader_hf:Doc Example: {'id': '632589828c8b9fca2c3a59e97451fde8fa7d188d', 'title': 'A hybrid of genetic algorithm and particle swarm optimization for recurrent network design', 'text': 'An evolutionary recurrent network which automates the design of recurrent neural/fuzzy networks using a new evolutionary learning algorithm is proposed in this paper. This new evolutionary learning algorithm is based on a hybrid of genetic algorithm (GA) and particle swarm optimization (PSO), and is thus called HGAPSO. In HGAPSO, individuals in a new generation are created, not only by crossover and mutation operation as in GA, but also by PSO. The concept of elite strategy is adopted in HGAPSO, where the upper-half of the best-performing individuals in a population are regarded as elites. However, instead of being reproduced directly to the next generation, these elites are first enhanced. The group constituted by the elites is regarded as a swarm, and each elite corresponds to a particle within it. In this regard, the elites are enhanced by PSO, an operation which mimics the maturing phenomenon in nature. These enhanced elites constitute half of the population in the new generation, whereas the other half is generated by performing crossover and mutation operation on these enhanced elites. HGAPSO is applied to recurrent neural/fuzzy network design as follows. For recurrent neural network, a fully connected recurrent neural network is designed and applied to a temporal sequence production problem. For recurrent fuzzy network design, a Takagi-Sugeno-Kang-type recurrent fuzzy network is designed and applied to dynamic plant control. The performance of HGAPSO is compared to both GA and PSO in these recurrent networks design problems, demonstrating its superiority.'} INFO:beir.datasets.data_loader_hf:Loading Queries... INFO:beir.datasets.data_loader_hf:Loaded 1000 TEST Queries. INFO:beir.datasets.data_loader_hf:Query Example: {'id': '78495383450e02c5fe817e408726134b3084905d', 'text': 'A Direct Search Method to solve Economic Dispatch Problem with Valve-Point Effect'} INFO:faiss.loader:Loading faiss with AVX2 support. INFO:faiss.loader:Successfully loaded faiss with AVX2 support. INFO:beir.retrieval.search.dense.exact_search:Encoding Queries... ERROR:mteb.evaluation.MTEB:Error while evaluating SCIDOCS: 'Dataset' object has no attribute 'keys' Traceback (most recent call last): File "/home/santi/MSTEB/run_example.py", line 36, in evaluation.run(model, output_folder=f"results/{model_name}", eval_splits=["test"]) File "/home/santi/MSTEB/mteb/evaluation/MTEB.py", line 289, in run raise e File "/home/santi/MSTEB/mteb/evaluation/MTEB.py", line 271, in run results = task.evaluate(model, split, kwargs) File "/home/santi/MSTEB/mteb/abstasks/AbsTaskRetrieval.py", line 82, in evaluate results = retriever.retrieve(corpus, queries) File "/opt/conda/lib/python3.10/site-packages/beir/retrieval/evaluation.py", line 20, in retrieve return self.retriever.search(corpus, queries, self.top_k, self.score_function, kwargs) File "/opt/conda/lib/python3.10/site-packages/beir/retrieval/search/dense/exact_search.py", line 39, in search query_ids = list(queries.keys()) AttributeError: 'Dataset' object has no attribute 'keys'



I believe it might be a bug because if I run the script without setting USE_HF_DATASETS=True the evaluation works. The error seems to come from BeIR `EvaluateRetrieval`. 

Thanks in advance!
SantiDianaClibrain commented 11 months ago

It also happens with some other datasets such as Arguana

Muennighoff commented 11 months ago

Hmm, yeah the BEIR integration is not great. At this point, I think we should drop the BEIR dependency, copy over the BEIR datasets to the MTEB HF org and just download them from there. CQADupstack docs are not on the hub yet afaik, so that dataset needs to be newly created. We can also drop the multi-node/multi-gpu evaluation code because this can actually be done much easier on the user's side by just wrapping their model inside DDP or similar. This would simplify the code, reduce dependencies and make MTEB much more usable & extendable I think.

Unfortunately, I don't have time to work on this atm, but if you or someone else does (maybe @NouamaneTazi @loicmagne), this would be an amazing contribution.

KennethEnevoldsen commented 7 months ago

Related to #233. This PR should resolve this issue