BrambleXu / knowledge-graph-learning

A curated list of awesome knowledge graph tutorials, projects and communities.
MIT License
733 stars 121 forks source link

ACL-2021-TAT-QA: A Question Answering Benchmark on a Hybrid of Tabular and Textual Content in Finance #362

Open BrambleXu opened 1 year ago

BrambleXu commented 1 year ago

Summary:

从年报中抽取表格和文字,构建一个QA数据集。提出了一个新的QA模型,可以在表格和文字之间进行推理。

Resource:

Paper information:

Notes:

image

The left box of Figure 1 shows a real example from some financial report, where there is a table containing row/column header and numbers inside, and also some paragraphs describing it. We call the hybrid data like this example hybrid context in QA problems, as it contains both tabular and textual content, and call the paragraphs associated paragraphs to the table.

所谓的hybrid context,关注点在于表格和表格下面的描述语句。需要通过描述对表格里数字进行推理。

数据制作方面,在Annual reports上收集了过去两年500份报告,使用 (Li et al., 2019) 的table detection模型,然后使用Apache PDFBox来抽取表格内容。对于表格,只抽取3~30行,3~6列。最后,一共得到了2万个表格,这些表格都没有标准的格式。这些表格也可能包含一些错误,比如行很少或列很少,数字缺失。在标注阶段,会人工挑出这些表格,删除,或修正。

标注阶段

2.3 Quality Control TODO

Model Graph:

Result:

Thoughts:

Next Reading: