Solution for transforming retrieval datasets into parquet

embeddings-benchmark / mteb

MTEB: Massive Text Embedding Benchmark

Apache License 2.0

1.93k stars 265 forks source link

I believe we had this trust_remote_code issue a while ago when we wanted to turn files into parquet, and retrieval datasets weren't compatible. Just confirmed with @KennethEnevoldsen this hasn't been solved.

Happened to find a solution here, where they turn corpus, queries and qrels separately into parquets. Can then load_dataset(dataset_name, "qrels"), load_dataset(dataset_name, "query"), load_dataset(dataset_name, "corpus").

I had a go implementing i2t retrieval using this format here. Works smoothly. Will follow this solution when creating more image-text retrieval ones and maybe for main branch we can deal with it the same way!

embeddings-benchmark / mteb

Solution for transforming retrieval datasets into parquet #1090