beir-cellar / beir

A Heterogeneous Benchmark for Information Retrieval. Easy to use, evaluate your models across 15+ diverse IR datasets.
http://beir.ai
Apache License 2.0
1.55k stars 186 forks source link

how was the hotpot_qa dataset preprocessed? #181

Open DanielSchuhmacher opened 1 week ago

DanielSchuhmacher commented 1 week ago

I am curious how you created the list of documents (the corpus). The original hotpot_qa does not come with that list of documents. Instead for each query it comes with a list of only 10 documents - 2 documents with the content for the gold answer and 8 distractor documents. My current assumption is the following. you took the distractor dataset and extracted the documents for all queries to build the corpus. The 2 gold documents in the original hotpotqa were then marked as the relevant documents for a specific query.

Please let me know how it works, since this confuses me quite a lot. Thank you very much!

If my assumption is correct you could also have a look at the multi-hop-rag dataset which was specifically created in that format already (corpus is seperated from the query and answer). The documents are also longer, which I think is a more realistic use case for a retrieval system, specially a RAG system.