Table of contents:
IM-TQA is a Chinese table question answering dataset with 1,200 tables and 5,000 question-answer pairs, which highlights Implicit and Multi-type table structures for real-world TQA scenarios. It yields a more challenging table QA setting with two characteristics:
By contrast, previous TQA benchmarks mainly focus on limited table types with explicit table structures (i.e., the model knows exactly which cells are headers). We collect multi-type tables and ask professional annotators to provide the following annotations: (1) table types, (2) header cell locations, (3) natural language look-up questions together with (4) their answer cell locations. More details, analyses, and baseline results can be found in the paper.
As shown in Figure 1, we divide tables into 4 types according to their structure characteristics, which is in line with previous works with complex table as an important complement. Exploring and including more table types deserve future investigations.
To promote the understanding of implicit table structures, we categorize table cells into 5 types based on their functional roles, with the concentration on header cells that are useful for TQA models to locate correct answer cells.
In order to store various tables, we design a storage method which separately stores cell positions $P$ and cell contents $V$. To store cell positions, a cell ID is assigned to each table cell in the row-first order. For a table including $m$ rows and $n$ columns, its cell IDs constitute an $m×n$ matrix representing cell locations. This matrix contains table layout information such as neighbouring relations between different cells. As for cell contents, every cell value is put into a list in the same row-first order. An example format is shown in Figure 3. Given the cell ID matrix and cell value list, we instructed annotators in distinguishing 5 cell types and asked them to annotate cell ID lists of attribute and index cells. Other table cells are deemed pure data cells. After identifying header cells, we asked annotators to raise look-up questions about data cells and label answer cell IDs.
IM-TQA dataset consists of six .json
files for train/dev/test samples in the data
directory. train_tables.json
, dev_tables.json
, and test_tables.json
store table data and annotated header cells, and train_questions.json
, dev_questions.json
, and test_questions.json
store question-answer pairs. Table samples and question-answer pairs are dictionary objects. Though IM-TQA is collected from Chinese tables, we adopt a commercial machine translation model to translate tables and questions in IM-TQA from Chinese into English. But it should be noted that we did not double check the translation results so the translation quality may be poor.
Table sample format:
{
"table_id": "Z56mZoK9", # unique table id
"table_type": "vertical", # table type, possible table types: 'vertical', 'horizontal', 'hierarchical' or 'complex'.
"file_name": "垂直表格_216", # chinese table file name
"cell_ID_matrix": [[0,1,2,3], # cell_ID_matrix to store table layout information, which consists of several cell ID lists in the in the row-first order, e.g., [0,1,2,3] represents the first row.
[4,5,6,7]
,...,],
"chinese_cell_value_list": [ "序号", "客户", "销售金额", "年度销售占比%", "是否存在关联关系",...,], # cell_value_list to store cell content, which can be indexed by the cell ID in the cell_ID_matrix.
"english_cell_value_list": ["Serial No", "customer", "sales amount", "Proportion of annual sales%",...,], # cell_value_list translated into English.
"column_attribute": [0,1,2,3,4], # annotated cell ID list of different header cells.
"row_attribute": [],
"column_index": [],
"row_index": [5]
}
Question-answer pair sample format:
{
"table_id": "Z56mZoK9", # table_id is used to index the related table of each question.
"question_id": "Z56mZoK9_3", # unique question id
"file_name": "垂直表格_216", # chinese table file name
"chinese_question": "客户一的销售金额是多少?年度销售占比是多少?", # question text which is raised by annotators
"english_question": "What is the sales amount of Customer 1? What is the percentage of annual sales?", # english question text
"answer_cell_list": [7, 8], # cell id list of answer cells
"question_type": "arbitrary_cells" # question type, possible question types: 'single_cell', 'one_row', 'one_col' and 'arbitrary_cells'.
}
The dataset split statistics are shown below: | Train | Valid | Split | Total | |
---|---|---|---|---|---|
# tables | 936 | 111 | 153 | 1200 | |
# questions | 3909 | 464 | 627 | 5000 | |
# vertical tables | 224 | 31 | 45 | 300 | |
# horizontal tables | 230 | 34 | 36 | 300 | |
# hierarchical tables | 231 | 35 | 34 | 300 | |
# complex tables | 251 | 11 | 38 | 300 |
We evaluate traditional TQA methods and recent powerful large language models (LLMs) like ChatGPT. (The LLM's output files are stored in the llm_outputs
directory.) From the results shown below, we can find that ChatGPT performs pretty well in handing look-up questions which select specific table cells as answers. This also demonstrates that more complicated questions are needed to present a comprehensive evaluation of LLM's table understanding ability. Some recent studies have made valid progress towards this goal, e.g., [1], [2].
Model | Exact Match Acc(%) | ||||
All Tables | Vertical | Horizontal | Hierarchical | Complex | |
Ernie-Layout | 11.6 | 11.5 | 4.10 | 5.66 | 22.6 |
Tapex | 13.1 | 14.9 | 10.7 | 8.18 | 17.4 |
RAT | 18.5 | 34.5 | 33.6 | 5.03 | 4.07 |
TAPAS | 33.2 | 58.0 | 31.1 | 26.4 | 15.7 |
RCI | 47.2 | 68.4 | 45.1 | 56.0 | 19.2 |
RCI-AIT | 49.6 | 69.5 | 43.4 | 60.4 | 23.8 |
RGCN-RCI | 53.4 | 70.7 | 45.9 | 62.9 | 32.0 |
ChatGPT
(zero-shot) | 92.3 | 93.1 | 92.6 | 91.2 | 92.2 |
Human | 95.1 | 96.6 | 95.1 | 94.3 | 94.1 |
We use PaddlePaddle to implement our model and all experiments were conducted on a NVIDIA TITAN RTX 24GB GPU. We understand that configuring the experiment environment with PaddlePaddle may encouter some problems and we suggest looking for solutions from the official PaddlePaddle Github Issues. The trained RGCN and RCI model weights can be downloaded from the Google Drive
conda create -n IM_TQA python=3.7
conda activate IM_TQA
pip install -r requirements.txt
The 'init_embedding_model' is the model name which is used to encode cell text to 768-dim semantic features. It will be passed to model.from_pretrained() and you can change the code to set it to the local path of your pre-downloaded model. The resulting PGL graph objects will be saved as pickle files (.pkl).
cd CTC_code
python convert_tables_to_graphs.py \
--tables_dir='../data/' \
--saved_graphs_dir='../data/' \
--init_embedding_model='bert-base-chinese'
# or you can directly run: sh build_graphs_based_on_tables.sh
The auto encoder is used to convert discrete 24-dim manual features to continuous 32-dim features. The resulting 32-dim cell features of each table will also be saved as pickle files (.pkl).
CUDA_VISIBLE_DEVICES=0 nohup python train_auto_encoder.py \
--run_num=1 \
--enc_hidden_dim=32 \
--manual_feat_dim=24 \
--random_seed=12345 \
--data_dir='../data/' \
--feats_save_dir='../data/' \
--model_save_dir='./saved_models/ctc_auto_encoder/' > ./log_files/train_auto_encoder_to_encode_manual_cell_feats.log &
# or you can directly run: sh train_auto_encoder.sh
python3 add_manual_feats_to_table_graphs.py
Make sure data paths in 'add_manual_feats_to_table_graphs.py' are correct and the resulting heter graphs with node features of two types will be saved as pickle files (.pkl).
This script will train an R-GCN model for CTC task using constructed heterogeneous graphs of the train split. It will save the best CTC model based on performance on validation split and predicted results (CTC task) of tables of each split will be saved for the subsequent table question answering (TQA) task. You can also save model of each epoch and select the best model based on you own metric.
sh train_ctc_gnn.sh
The implementation of TQA model is adapted from the codebase of the original RCI model which uses PyTorch.
First cd TQA_code
and construct row and column representations of train and test splits using build_RCI_train_and_test_data.ipynb
. Put the resulting files in the TQA_code/datasets/IM_TQA/
, which include 4 files (i.e., train_cols.jsonl.gz
, train_rows.jsonl.gz
, test_cols.jsonl.gz
and test_rows.jsonl.gz
).
The training task is a 2-class sentence-pair classification task. Given a row or a column representation and a input question, the bert-base-chinese model learns to predict whether this row or column contains the final answer cell(s). The trained model will be respectively saved at ./datasets/IM_TQA/bert-base-chinese-epoch3-warmup0.1/col_bert_base
and ./datasets/IM_TQA/bert-base-chinese-epoch3-warmup0.1/row_bert_base
.
sh train_RCI_bert.sh
In this step, the trained row and column model predicts whether a row or a column contains the answer cells. The inference results will be saved at ./datasets/IM_TQA/apply_bert/col_bert/results0.jsonl.gz
and ./datasets/IM_TQA/apply_bert/row_bert/results0.jsonl.gz
.
sh apply_RCI.sh
Based on the positive row ids and column ids, the predicted answer cell ids are extracted (i.e., cell_ID_matric[row_id][col_id]) and are compared with gold answer cell ids to compute exact match scores. Make sure related file path in the compute_RCI_exact_match.py
are correct (line 36-47). The predicted results of one run will be saved at ./datasets/IM_TQA/RGCN-RCI_test_pred_results.pkl
.
python compute_RCI_exact_match.py
Since we have provided results of Step 3 of one experiment, you can directly run the above command to validate its results. This should give:
(1) report on all tables:
total exact match score: 0.5311004784688995
correct question num: 333
total question num: 627
--------------------
(2) report on complex tables:
exact match score on complex tables: 0.3023255813953488
correct question num on complex tables: 52
total question num on complex tables: 172
--------------------
(3) report on vertical tables:
exact match score on vertical tables: 0.7126436781609196
correct question num on vertical tables: 124
total question num on vertical tables: 174
--------------------
(4) report on horizontal tables:
exact match score on horizontal tables: 0.45901639344262296
correct question num on horizontal tables: 56
total question num on horizontal tables: 122
--------------------
(5) report on hierarchical tables:
exact match score on hierarchical tables: 0.6352201257861635
correct question num on hierarchical tables: 101
total question num on hierarchical tables: 159
Though we made the first exploration towards real-life TQA scenarios with implicit and multi-type tables, this work faces some limitations:
If you find this work useful, please considering cite our work:
@inproceedings{zheng-etal-2023-im,
title = "{IM}-{TQA}: A {C}hinese Table Question Answering Dataset with Implicit and Multi-type Table Structures",
author = "Zheng, Mingyu and
Hao, Yang and
Jiang, Wenbin and
Lin, Zheng and
Lyu, Yajuan and
She, QiaoQiao and
Wang, Weiping",
booktitle = "Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
month = jul,
year = "2023",
address = "Toronto, Canada",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2023.acl-long.278",
doi = "10.18653/v1/2023.acl-long.278",
pages = "5074--5094",
}
This dataset follows the Computational Use of Data Agreement v1.0.
Despite our best efforts, there maybe still some errors in this dataset. If you have any question regarding the IM-TQA dataset, please create an issue in this repository. You can also reach us by e-mail addresses in the paper.