how was the hotpot_qa dataset preprocessed?

I am curious how you created the list of documents (the corpus). The original hotpot_qa does not come with that list of documents. Instead for each query it comes with a list of only 10 documents - 2 documents with the content for the gold answer and 8 distractor documents. My current assumption is the following. you took the distractor dataset and extracted the documents for all queries to build the corpus. The 2 gold documents in the original hotpotqa were then marked as the relevant documents for a specific query.

Please let me know how it works, since this confuses me quite a lot. Thank you very much!

If my assumption is correct you could also have a look at the multi-hop-rag dataset which was specifically created in that format already (corpus is seperated from the query and answer). The documents are also longer, which I think is a more realistic use case for a retrieval system, specially a RAG system.

beir-cellar / beir

how was the hotpot_qa dataset preprocessed? #181