NAACL-2019-ComQA: A Community-sourced Dataset for Complex Factoid Question Answering with Paraphrase Clusters

Summary:

又一个公布数据集的论文。与 #310 从Freebase里创建数据不同，这个数据集是从WikiAnswers community 的QA平台制作的。按照问题的释义对问题进行了分组。相关研究的部分介绍了 factoid QA task当前的两大流派，一个是QA over textual corpora，一个是QA over KBs。

Resource:

pdf
dataset
[paper-with-code](

Paper information:

Author:
Dataset:
keywords:

Notes:

根据wiki的介绍，Answers.com整合了WikiAnswers。所以现在无法直接搜索到WikiAnswers的网站了。

相关研究

相关研究的部分介绍了 factoid QA task当前的两大流派，一个是QA over textual corpora，一个是QA over KBs。

QA over textual corpora（2000-2015）. 这个主要是从textual sources里找到答案。这方面的benchmark任务主要有TREC和CLEF。

最近Reading comprehension (RC)被引入了这个领域中（2015-2017）。目标是answer a question from a given textual paragraph。这个和factoid QA有些不同。因为factoid QA是从大量的文档中找到答案，而不是单个段落中找答案。

QA over KBs（2015-2018）. 通过semantic parsing将question翻译为structured queries（这方面有很多研究）。过去5年，这方面出了很多数据集。下表展示了不同数据集的维度。然后说ComQA比其他数据集要好。

3 Overview

定义：

A factoid question is a question whose answer is one or a small number of entities or literal values (Voorhees and Tice, 2000).
e.g., “Who were the secretaries of state under Barack Obama?” and “When was Germany’s first postwar chancellor born?”.

3.1 Questions in ComQA

这里给不同问题做了不同的分类。

Simple：询问一个entity的property，5W1H。 (e.g., “Where was Einstein born?”)
Compositional（综合的）：问题中的entity可能可能会互相相关或嵌套。intersected，两个问题可以被独立回答，但二者相关 (e.g., “Which films featuring Tom Hanks did Spielberg direct?”)。nested，回答一个问题必须先回答另一个子问题， (“Who were the parents of the thirteenth president of the US?”).
Temporal（时间的）：需要对时间进行推理。explicit (e.g., ‘in 1998’), implicit (e.g., ‘during the WWI’), relative (e.g., ‘current’), or latent (e.g. ‘Who is the US president?’). 或者有明确时间标识的When问句， explicit temporal expression (“When did Trenton become New Jersey’s capital?”).
Comparison （）

Model Graph:

Result:：

Thoughts:

果然这种开放式的前提下，问题的分类会变得非常多，变数增多。其实把domain设定好，针对某个domain可以预知常见的提问，比如公司信息领域，”XXX公司的营业收入是多少“。这样的话，不仅可以把问题的范围缩小，还可以提高对应的识别精度

Next Reading:

BrambleXu / knowledge-graph-learning

NAACL-2019-ComQA: A Community-sourced Dataset for Complex Factoid Question Answering with Paraphrase Clusters #311

相关研究