PaddlePaddle / RocketQA

🚀 RocketQA, dense retrieval for information retrieval and question answering, including both Chinese and English state-of-the-art models.
Apache License 2.0
767 stars 128 forks source link

Dureader #15

Closed yclzju closed 2 years ago

yclzju commented 2 years ago

Hi,Great job, I find that you release Chinese retrieval model trained on Dureader, Could you please also release your preprocess code or processed datasets.

LegendaryDan commented 2 years ago

We are working on cleaning a better version of the data, and will release the data soon. Stay tuned!

Thanks.

yclzju commented 2 years ago

Great! Will you release your code of preprocessing data including msmarco, dureader first?

LegendaryDan commented 2 years ago

@yclzju We have released DuReader_retrieval, a large-scale Chinese benchmark for passage retrieval. The dataset contains over 90K questions and 8M passages from Baidu Search.

paper: https://arxiv.org/abs/2203.10232 data: https://github.com/baidu/DuReader/tree/master/DuReader-Retrieval baseline: https://github.com/PaddlePaddle/RocketQA/tree/main/research/DuReader-Retrieval-Baseline