embeddings-benchmark / mteb

MTEB: Massive Text Embedding Benchmark
https://arxiv.org/abs/2210.07316
Apache License 2.0
1.93k stars 265 forks source link

Solution for transforming retrieval datasets into parquet #1090

Open gowitheflow-1998 opened 3 months ago

gowitheflow-1998 commented 3 months ago

I believe we had this trust_remote_code issue a while ago when we wanted to turn files into parquet, and retrieval datasets weren't compatible. Just confirmed with @KennethEnevoldsen this hasn't been solved.

Happened to find a solution here, where they turn corpus, queries and qrels separately into parquets. Can then load_dataset(dataset_name, "qrels"), load_dataset(dataset_name, "query"), load_dataset(dataset_name, "corpus").

I had a go implementing i2t retrieval using this format here. Works smoothly. Will follow this solution when creating more image-text retrieval ones and maybe for main branch we can deal with it the same way!

KennethEnevoldsen commented 3 months ago

I believe we had this trust_remote_code issue a while ago when we wanted to turn files into parquet, and retrieval datasets weren't compatible. Just confirmed with @KennethEnevoldsen this hasn't been solved.

It is solved atm by setting trust_remote_code=True, where required, but future dataset should not use this (tests will fail). It would be great if someone would fix older datasets as well, but it is not strictly required.