SpursGoZmy / IM-TQA

Dataset and Code for ACL 2023 paper: "IM-TQA: A Chinese Table Question Answering Dataset with Implicit and Multi-type Table Structures". We proposed a new TQA problem which aims at real application scenarios, together with a supporting dataset and a baseline method.
Other
15 stars 3 forks source link

IM-TQA: A Chinese Table Question Answering Dataset with Implicit and Multi-type Table Structures

Table of contents:

1. Dataset Description

IM-TQA is a Chinese table question answering dataset with 1,200 tables and 5,000 question-answer pairs, which highlights Implicit and Multi-type table structures for real-world TQA scenarios. It yields a more challenging table QA setting with two characteristics:

  1. models need to handle different types of tables. (i.e., Multi-type)
  2. header cell annotations are not provided to models directly. (i.e., Implicit)

By contrast, previous TQA benchmarks mainly focus on limited table types with explicit table structures (i.e., the model knows exactly which cells are headers). We collect multi-type tables and ask professional annotators to provide the following annotations: (1) table types, (2) header cell locations, (3) natural language look-up questions together with (4) their answer cell locations. More details, analyses, and baseline results can be found in the paper.

2. Considered Table Types and Header Cells

As shown in Figure 1, we divide tables into 4 types according to their structure characteristics, which is in line with previous works with complex table as an important complement. Exploring and including more table types deserve future investigations.

To promote the understanding of implicit table structures, we categorize table cells into 5 types based on their functional roles, with the concentration on header cells that are useful for TQA models to locate correct answer cells.

3. Table Storage and Annotation

In order to store various tables, we design a storage method which separately stores cell positions $P$ and cell contents $V$. To store cell positions, a cell ID is assigned to each table cell in the row-first order. For a table including $m$ rows and $n$ columns, its cell IDs constitute an $m×n$ matrix representing cell locations. This matrix contains table layout information such as neighbouring relations between different cells. As for cell contents, every cell value is put into a list in the same row-first order. An example format is shown in Figure 3. Given the cell ID matrix and cell value list, we instructed annotators in distinguishing 5 cell types and asked them to annotate cell ID lists of attribute and index cells. Other table cells are deemed pure data cells. After identifying header cells, we asked annotators to raise look-up questions about data cells and label answer cell IDs.

4. Sample Format

IM-TQA dataset consists of six .json files for train/dev/test samples in the data directory. train_tables.json, dev_tables.json, and test_tables.json store table data and annotated header cells, and train_questions.json, dev_questions.json, and test_questions.json store question-answer pairs. Table samples and question-answer pairs are dictionary objects. Though IM-TQA is collected from Chinese tables, we adopt a commercial machine translation model to translate tables and questions in IM-TQA from Chinese into English. But it should be noted that we did not double check the translation results so the translation quality may be poor.

Table sample format:

{
  "table_id": "Z56mZoK9",  # unique table id
  "table_type": "vertical",    # table type, possible table types: 'vertical', 'horizontal', 'hierarchical' or 'complex'.
  "file_name": "垂直表格_216",  # chinese table file name
  "cell_ID_matrix": [[0,1,2,3],  # cell_ID_matrix to store table layout information, which consists of several cell ID lists in the in the row-first order, e.g., [0,1,2,3] represents the first row.
                     [4,5,6,7]
                      ,...,],   
  "chinese_cell_value_list":  [ "序号", "客户", "销售金额", "年度销售占比%", "是否存在关联关系",...,], # cell_value_list to store cell content, which can be indexed by the cell ID in the cell_ID_matrix.
  "english_cell_value_list": ["Serial No", "customer", "sales amount", "Proportion of annual sales%",...,],  # cell_value_list translated into English.
  "column_attribute": [0,1,2,3,4],  # annotated cell ID list of different header cells.
  "row_attribute": [],
  "column_index": [],
  "row_index": [5]
}

Question-answer pair sample format:

{
    "table_id": "Z56mZoK9",  # table_id is used to index the related table of each question.
    "question_id": "Z56mZoK9_3", # unique question id
    "file_name": "垂直表格_216", # chinese table file name
    "chinese_question": "客户一的销售金额是多少?年度销售占比是多少?", # question text which is raised by annotators
    "english_question": "What is the sales amount of Customer 1? What is the percentage of annual sales?", # english question text
    "answer_cell_list": [7, 8], # cell id list of answer cells
    "question_type": "arbitrary_cells"  # question type, possible question types: 'single_cell', 'one_row', 'one_col' and 'arbitrary_cells'.
}
The dataset split statistics are shown below: Train Valid Split Total
# tables 936 111 153 1200
# questions 3909 464 627 5000
# vertical tables 224 31 45 300
# horizontal tables 230 34 36 300
# hierarchical tables 231 35 34 300
# complex tables 251 11 38 300

5. Leader Board

We evaluate traditional TQA methods and recent powerful large language models (LLMs) like ChatGPT. (The LLM's output files are stored in the llm_outputs directory.) From the results shown below, we can find that ChatGPT performs pretty well in handing look-up questions which select specific table cells as answers. This also demonstrates that more complicated questions are needed to present a comprehensive evaluation of LLM's table understanding ability. Some recent studies have made valid progress towards this goal, e.g., [1], [2].

Model Exact Match Acc(%)
All Tables Vertical Horizontal Hierarchical Complex
Ernie-Layout 11.6 11.5 4.10 5.66 22.6
Tapex 13.1 14.9 10.7 8.18 17.4
RAT 18.5 34.5 33.6 5.03 4.07
TAPAS 33.2 58.0 31.1 26.4 15.7
RCI 47.2 68.4 45.1 56.0 19.2
RCI-AIT 49.6 69.5 43.4 60.4 23.8
RGCN-RCI 53.4 70.7 45.9 62.9 32.0
ChatGPT
(zero-shot)
92.3 93.1 92.6 91.2 92.2
Human 95.1 96.6 95.1 94.3 94.1

6. Model Training and Evaluation

6.1 Environment Setup and Model Weights

We use PaddlePaddle to implement our model and all experiments were conducted on a NVIDIA TITAN RTX 24GB GPU. We understand that configuring the experiment environment with PaddlePaddle may encouter some problems and we suggest looking for solutions from the official PaddlePaddle Github Issues. The trained RGCN and RCI model weights can be downloaded from the Google Drive

conda create -n IM_TQA python=3.7
conda activate IM_TQA
pip install -r requirements.txt

6.2 RGCN for Cell Type Classification (CTC)

Step 1: Convert Tables into Heterogeneous Graphs in PGL

The 'init_embedding_model' is the model name which is used to encode cell text to 768-dim semantic features. It will be passed to model.from_pretrained() and you can change the code to set it to the local path of your pre-downloaded model. The resulting PGL graph objects will be saved as pickle files (.pkl).

cd CTC_code
python convert_tables_to_graphs.py \
--tables_dir='../data/' \
--saved_graphs_dir='../data/' \
--init_embedding_model='bert-base-chinese'
# or you can directly run: sh build_graphs_based_on_tables.sh 

Step 2: Train an AutoEncoder

The auto encoder is used to convert discrete 24-dim manual features to continuous 32-dim features. The resulting 32-dim cell features of each table will also be saved as pickle files (.pkl).

CUDA_VISIBLE_DEVICES=0 nohup python train_auto_encoder.py \
--run_num=1 \
--enc_hidden_dim=32 \
--manual_feat_dim=24 \
--random_seed=12345 \
--data_dir='../data/' \ 
--feats_save_dir='../data/' \ 
--model_save_dir='./saved_models/ctc_auto_encoder/' > ./log_files/train_auto_encoder_to_encode_manual_cell_feats.log &
# or you can directly run: sh train_auto_encoder.sh

Step 3: Include 32-dim features to existing graphs to obtain final graphs

python3 add_manual_feats_to_table_graphs.py

Make sure data paths in 'add_manual_feats_to_table_graphs.py' are correct and the resulting heter graphs with node features of two types will be saved as pickle files (.pkl).

Step 4: Train an R-GCN model for CTC task

This script will train an R-GCN model for CTC task using constructed heterogeneous graphs of the train split. It will save the best CTC model based on performance on validation split and predicted results (CTC task) of tables of each split will be saved for the subsequent table question answering (TQA) task. You can also save model of each epoch and select the best model based on you own metric.

sh train_ctc_gnn.sh

6.3 RCI for Table Question Answering (TQA)

The implementation of TQA model is adapted from the codebase of the original RCI model which uses PyTorch.

Step 1: Construct row and column representations with header contents

First cd TQA_code and construct row and column representations of train and test splits using build_RCI_train_and_test_data.ipynb. Put the resulting files in the TQA_code/datasets/IM_TQA/, which include 4 files (i.e., train_cols.jsonl.gz, train_rows.jsonl.gz, test_cols.jsonl.gz and test_rows.jsonl.gz).

Step 2: Train the RCI row and column model

The training task is a 2-class sentence-pair classification task. Given a row or a column representation and a input question, the bert-base-chinese model learns to predict whether this row or column contains the final answer cell(s). The trained model will be respectively saved at ./datasets/IM_TQA/bert-base-chinese-epoch3-warmup0.1/col_bert_base and ./datasets/IM_TQA/bert-base-chinese-epoch3-warmup0.1/row_bert_base.

sh train_RCI_bert.sh 

Step 3: Apply the RCI row and column model on the test splits

In this step, the trained row and column model predicts whether a row or a column contains the answer cells. The inference results will be saved at ./datasets/IM_TQA/apply_bert/col_bert/results0.jsonl.gz and ./datasets/IM_TQA/apply_bert/row_bert/results0.jsonl.gz.

sh apply_RCI.sh 

Step 4: Compute exact match score

Based on the positive row ids and column ids, the predicted answer cell ids are extracted (i.e., cell_ID_matric[row_id][col_id]) and are compared with gold answer cell ids to compute exact match scores. Make sure related file path in the compute_RCI_exact_match.py are correct (line 36-47). The predicted results of one run will be saved at ./datasets/IM_TQA/RGCN-RCI_test_pred_results.pkl.

python compute_RCI_exact_match.py

Since we have provided results of Step 3 of one experiment, you can directly run the above command to validate its results. This should give:

(1) report on all tables:
total exact match score:  0.5311004784688995
correct question num:  333
total question num: 627
--------------------
(2) report on complex tables:
exact match score on complex tables: 0.3023255813953488
correct question num on complex tables: 52
total question num on complex tables: 172
--------------------
(3) report on vertical tables:
exact match score on vertical tables: 0.7126436781609196
correct question num on vertical tables: 124
total question num on vertical tables: 174
--------------------
(4) report on horizontal tables:
exact match score on horizontal tables: 0.45901639344262296
correct question num on horizontal tables: 56
total question num on horizontal tables: 122
--------------------
(5) report on hierarchical tables:
exact match score on hierarchical tables: 0.6352201257861635
correct question num on hierarchical tables: 101
total question num on hierarchical tables: 159

7. Limitations

Though we made the first exploration towards real-life TQA scenarios with implicit and multi-type tables, this work faces some limitations:

Reference

If you find this work useful, please considering cite our work:

@inproceedings{zheng-etal-2023-im,
    title = "{IM}-{TQA}: A {C}hinese Table Question Answering Dataset with Implicit and Multi-type Table Structures",
    author = "Zheng, Mingyu  and
      Hao, Yang  and
      Jiang, Wenbin  and
      Lin, Zheng  and
      Lyu, Yajuan  and
      She, QiaoQiao  and
      Wang, Weiping",
    booktitle = "Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
    month = jul,
    year = "2023",
    address = "Toronto, Canada",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2023.acl-long.278",
    doi = "10.18653/v1/2023.acl-long.278",
    pages = "5074--5094",
}

License

This dataset follows the Computational Use of Data Agreement v1.0.

Contact

Despite our best efforts, there maybe still some errors in this dataset. If you have any question regarding the IM-TQA dataset, please create an issue in this repository. You can also reach us by e-mail addresses in the paper.