DuReader focus on the benchmarks and models of machine reading comprehension for question answering.
Dataset:
DuReader-vis
: The first Chinese Open-domain Document Visual Question Answering (Open-Domain DocVQA) dataset. [Paper]
DuReader Retrieval
: A large-scale Chinese dataset for passage retrieval. [Paper][Code] [Leaderboard]
DuQM
: Linguistically Perturbed Natural Questions for Evaluating the Robustness of Question Matching Models.[Paper][Code] [Leaderboard]
DuReader Checklist
: A dataset challenging model understanding capabilities in vocabulary, phrase, semantic role, reasoning. [Code] [Leaderboard]
DuReader Yes/No
: A dataset challenging models in opinion polarity judgment. [Code] [Leaderboard]
DuReader Robust
: A dataset challenging models in (1)over-sensitivity, (2)over-stability and (3)generalization. [Paper] [Code] [Learderboard]
DuReader 2.0
: A new large-scale real-world and human sourced MRC dataset [Paper] [Code] [Leaderboard]
DuReader Robust
, DuReader Yes/No
, DuReader Checklist
, DuQM
can be downloaded at qianyan official website. DuReader-vis
can be downloaded by following the method in DuReader-vis/README.md
at this repository. DuReader 2.0
can be downloaded by following the method in DuReader-2.0/README.md
at this repository.
Models:
KT-NET
: A machine reading comprehension (MRC) model which integrates knowledge from knowledge bases (KBs) into pre-trained contextualized representations. [Paper] [Code] [Learderboard]
D-NET
: A simple pre-training and fine-tuning framework which focused on the generalization of machine reading comprehension (MRC) models. [Paper] [Code] [Learderboard]
DuReader contains four datasets: DuReader 2.0
, DuReader Robust
, DuReader Yes/No
, DuReader Checklist
and DuReader-vis
. The main features of these datasets include:
DuReader is a new large-scale real-world and human sourced MRC dataset in Chinese. DuReader focuses on real-world open-domain question answering. The advantages of DuReader over existing datasets are concluded as follows: Real question, Real article, Real answer, Real application scenario and Rich annotation.
KT-NET (Knowledge and Text fusion NET) is a machine reading comprehension (MRC) model which integrates knowledge from knowledge bases (KBs) into pre-trained contextualized representations. The model is proposed in ACL2019 paper Enhancing Pre-Trained Language Representations with Rich Knowledge for Machine Reading Comprehension.
D-NET is a simple system Baidu submitted for MRQA (Machine Reading for Question Answering) 2019 Shared Task that focused on generalization of machine reading comprehension (MRC) models. The system is built on a framework of pretraining and fine-tuning. The techniques of pre-trained language models and multi-task learning are explored to improve the generalization of MRC models. D-NET is ranked at top 1 of all the participants in terms of averaged F1 score.
DuReader Robust is designed to challenge MRC models from the following aspects: (1) over-sensitivity, (2) over-stability and (3) generalization. Besides, DuReader Robust has another advantage over previous datasets: questions and documents are from Baidu Search. It presents the robustness issues of MRC models when applying them to real-world scenarios.
Span-based MRC tasks adopt F1 and EM metrics to measure the difference between predicted answers and labeled answers. However, the task about opinion polarity cannot be well measured by these metrics. DuReader Yes/No is proposed to challenge MRC models in opinion polarity, which will complement the disadvantages of existing MRC tasks and evaluate the effectiveness of existing models more reasonably.
DuReader Checklist is a high-quality Chinese machine reading comprehension dataset for real application scenarios. It is designed to challenge the natural language understanding capabilities from multi-aspect via systematic evaluation (i.e. checklist), including understanding of vocabulary, phrase, semantic role, reasoning and so on.
DuQM is a Chinese question matching robust dataset, which contains natural questions with linguistic perturbations to evaluate the robustness of question matching models. DuQM is designed to be fine-grained, diverse and natural. And it contains 3 categories and 13 subcategories with 32 linguistic perturbations.
DuReader Retrieval is a large-scale Chinese dataset for passage retrieval from web search engine. The dataset contains more than 90K queries and over 8M unique passages from realistic data sources.
DuReader-vis is the first Chinese Open-domain DocVQA dataset from web search engine. The dataset contains more than 15K labeled question-document pairs and over 158K unique documents from realistic data sources.
We make public a dataset loading and evaluation tool named qianyan
. You can use this package easily by following the qianyan repo.
Copyright 2017 Baidu.com, Inc. All Rights Reserved
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.
For help or issues using DuReader, including datasets and baselines, please submit a Github issue.
For other communication or cooperation, please contact Jing Liu (liujing46@baidu.com
) or Hongyu Li (lihongyu04@baidu.com
).