Transitioning retrieval datasets to retrieval format recommend in #1090

embeddings-benchmark / mteb

MTEB: Massive Text Embedding Benchmark

https://arxiv.org/abs/2210.07316

Apache License 2.0

1.99k stars 277 forks source link

Transitioning retrieval datasets to retrieval format recommend in #1090 #1282

Open KennethEnevoldsen opened 1 month ago

KennethEnevoldsen commented 1 month ago

Currently, we have to reimplement the load_data() (e.g. here) function whenever we add a retrieval dataset using the recommended format (#1090). We should change the default to correspond to the recommended.

For older datasets we can either update on huggingface and/or add a custom load_data function.

KennethEnevoldsen commented 1 month ago

Adding a bit more here: The current default format for the retrieval task (assumed by the load_data function) does not allow us to load the dataset usingdatasets.load_dataset(..., trust_remote_code=False), which introduced a safety concern and causes bugs (e.g. #1363)

In #1090 we introduce an alternative format, which is both more efficient and can be loaded using trust_remote_code=False.

(see #1308 for an example of a dataset PR that uses the old format originally and moves to the new)