[C-MTEB] How to convert QA dataset to Retrieval & Reranking Dataset

FlagOpen / FlagEmbedding

Retrieval and Retrieval-augmented LLMs

MIT License

6.22k stars 447 forks source link

[C-MTEB] How to convert QA dataset to Retrieval & Reranking Dataset #500

Open Iambestfeed opened 5 months ago

Iambestfeed commented 5 months ago

I observed that some datasets such as CmedqaRetrieval, CMedQAv1, CMedQAv2 Built from QA datasets and converted to 'query-pos-neg' format. Do you have 1 instruction for building this data? QA dataset sample:

Instruct:
Output:

Reranking dataset sample:

Query:
Pos:
Neg:

Retrieval dataset sample:

Query:
Context:
Id:

staoxiao commented 5 months ago

For QA datasets, we use query as query, and use answer/context as pos. We use the candidate (except ground truth) provided by the original dataset as neg.

If there are no candidates for your datasets, you can find some candidates via an embedding model to construct neg.

Iambestfeed commented 5 months ago

For QA datasets, we use query as , and use answer/context as . We use the candidate (except ground truth) provided by the original dataset as .query``pos``neg

If there are no candidates for your datasets, you can find some candidates via an embedding model to construct .neg

Thanks for answering, but I have a question if there is a way for me to filter out complex questions (tricky and subtextual questions whose answers are usually not directly related to the question)

staoxiao commented 5 months ago

A possible method is utilizing GPT to filter these questions. Using the cosine similarity between questions and answers is more simple, but the threshold is difficult to set.