The left box of Figure 1 shows a real example from some financial report, where
there is a table containing row/column header and numbers inside, and also some paragraphs describing it. We call the hybrid data like this example hybrid context in QA problems, as it contains both tabular and textual content, and call the paragraphs
associated paragraphs to the table.
数据制作方面,在Annual reports上收集了过去两年500份报告,使用 (Li et al., 2019) 的table detection模型,然后使用Apache PDFBox来抽取表格内容。对于表格,只抽取3~30行,3~6列。最后,一共得到了2万个表格,这些表格都没有标准的格式。这些表格也可能包含一些错误,比如行很少或列很少,数字缺失。在标注阶段,会人工挑出这些表格,删除,或修正。
QA pair制作。标注者需要主要制作一些不需要高深金融知识的问题。根据hybrid context,至少制作6个问题,包含extracted, calculated问题。对于extracted问题,回答可以是表格或段落里的single span or multiple spans。对于calculated问题,回答需要进行一定的numerical reasoning,比如加减乘除,比较,排序等。必要的话需要标注 right scale for the numerical answer
Answer Type and Derivation Annotation。回答的结果有3种,a single span
or multiple spans extracted from the table or text, as well as a generated answer (usually obtained through numerical reasoning). 标注者需要标注哪种类型。对于generated answer,还需要添加一些变形,方便扩展QA模型。
Summary:
从年报中抽取表格和文字,构建一个QA数据集。提出了一个新的QA模型,可以在表格和文字之间进行推理。
Resource:
Paper information:
Notes:
The left box of Figure 1 shows a real example from some financial report, where there is a table containing row/column header and numbers inside, and also some paragraphs describing it. We call the hybrid data like this example hybrid context in QA problems, as it contains both tabular and textual content, and call the paragraphs associated paragraphs to the table.
所谓的hybrid context,关注点在于表格和表格下面的描述语句。需要通过描述对表格里数字进行推理。
数据制作方面,在Annual reports上收集了过去两年500份报告,使用 (Li et al., 2019) 的table detection模型,然后使用Apache PDFBox来抽取表格内容。对于表格,只抽取3~30行,3~6列。最后,一共得到了2万个表格,这些表格都没有标准的格式。这些表格也可能包含一些错误,比如行很少或列很少,数字缺失。在标注阶段,会人工挑出这些表格,删除,或修正。
标注阶段
2.3 Quality Control TODO
Model Graph:
Result::
Thoughts:
Next Reading: