SpursGoZmy / Table-LLaVA

Dataset and Code for our ACL 2024 paper: "Multimodal Table Understanding". We propose the first large-scale Multimodal IFT and Pre-Train Dataset for table understanding and develop a generalist tabular MLLM named Table-LLaVA.
Apache License 2.0
155 stars 7 forks source link

Multimodal-Table-Understanding

paper dataset table_llava_7b llava_version

1.Introduction

Although great progress has been made by recent LLM-based table understanding methods, they rely heavily on the premise that given tables must be converted into a certain text sequence (such as Markdown or HTML) to serve as model input. However, it is difficult to access high-quality textual table representations in some real-world scenarios like scanned documents and webpage screentshots, and table images are much more accessible. Therefore, how to directly understand tables using intuitive visual information is a crucial and urgent challenge for developing more practical applications.

Facing the above challenge, we propose the multimodal table understanding problem, where the model is required to generate correct responses to different table-related requests (e.g., questions) in an end-to-end fashion based on the table image. Correspondingly, we construct MMTab, the first open-source large-scale dataset for multimodal table understanding problem, which can support both the training and evaluation of generalist MLLMs towards multimodal table understanding. Based on the curated MMTab dataset, we develop a versatile tabular MLLM named Table-LLaVA with an enhanced two-stage training paradigm of LLaVA v1.5. Table-LLaVA beats strong MLLM baselines on 17 held-in and 6 held-out benchmarks, and is even competitive with the powerful GPT-4V on 14 benchmarks under a subset of test samples. The right figure shows an intuitive comparison of Table LLaVA 7B and existing MLLMs on various multimodal table understanding benchmarks.

2. Dataset Description

We constructed MMTab based on 14 publicly available table datasets of 8 domains. We carefully design scripts to convert original textual tables in these datasets into table images highlighting a broad coverage of table structures and styles, and transform all task-specific samples into multimodal instruction-tuning samples with a unified format of <table image, input request, output response>. The resulting dataset contains three parts and can be downloaded from the Hugging Face Dataset. During the dataset construction, data augmentations at multiple levels (e.g., table-level, task-level) were adopted to further improve the data diversity.

Dataset Split #Table Images #Samples
MMTab-pre 97K 150K table recognition samples for pre-training
MMTab-instruct 82K 232K samples of 14 table-based tasks for instruction-tuning
MMTab-eval 23K 45K samples of 17 held-in benchmarks and 4K samples of 7 held-out benchmarks for evaluation

Dataset examples are shown in the following figure and more examples are shown in the Appendix A in the original paper.

3. Model Weights

Table LLaVA follows the LLaVA v1.5 architecture, with CLIP-ViT-L-336px as the visual encoder (336*336 image resolution), Vicuna-v1.5-7B or Vicuna-v1.5-13B as the base LLM and a two-layer MLP as the vision-language connector. The saved model checkpoints can be downloaded from the following Hugging Face Repository:

Version Size Schedule Base LLM Vision Encoder Projection layer Checkpoints
Table LLaVA 7B full_finetune-1_epoch Vicuna-v1.5-7B CLIP-ViT-L-336px MLP-2x SpursgoZmy/table-llava-v1.5-7b
Table LLaVA 13B full_finetune-1_epoch Vicuna-v1.5-13B CLIP-ViT-L-336px MLP-2x SpursgoZmy/table-llava-v1.5-13b
pretrained_mm_projector of Table LLaVA 7B 5M full_finetune-1_epoch Vicuna-v1.5-7B CLIP-ViT-L-336px MLP-2x SpursgoZmy/table-llava-v1.5-pretrained_mm_projector
pretrained_mm_projector of Table LLaVA 13B 5M full_finetune-1_epoch Vicuna-v1.5-13B CLIP-ViT-L-336px MLP-2x SpursgoZmy/table-llava-v1.5-pretrained_mm_projector

Note: The above Table-LLaVA checkpoints are saved from the original LLaVA repository, which is not directly compatible with the Transformers, i.e., it can not be directly loaded in the way like LlavaForConditionalGeneration.from_pretrained('SpursgoZmy/table-llava-v1.5-7b'). This problem is mentioned in this github issue. I will try the provided conversion script to make Table-LLaVa checkpoints become compatible with Transformers and upload new checkpoints to a new hub. But for now, maybe the checkpoints can only be loaded with the LLaVA repository like this instead of directly loading from HuggingFace. Sorry for this inconvenience!

4. Training

4.1 Environment Setup

We use the code base of LLaVA v1.5 for model training and inference. Thus, Table LLaVA can be used as the normal LLaVA v1.5 model and the environment can be installed in a similar way. Note that our code base is downloaded in December 2023 and maybe not the latest. Please refer to the official LLaVA v1.5 github for its latest update.

  1. Clone this repository and navigate to Table-LLaVA folder

    git clone https://github.com/SpursGoZmy/Table-LLaVA.git
    cd Table-LLaVA
  2. Install Package

    conda create -n table_llava python=3.10 -y
    conda activate table_llava
    pip install --upgrade pip  # enable PEP 660 support
    pip install -e .

4.2 Training Data and Hyperparameters

Table LLaVA training consists of two stages: (1) Pre-training stage: the vision-language connector (a two-layer MLP) is trained to connect the frozen pretrained vision encoder (ViT) to the frozen LLM (Vicuna v1.5); (2) Instruction-tuning stage: the vision-language connector and the base LLM are trained to follow multimodal instructions.

The training data of each stage is shown below:

Training Stage Data Description Data Size Hugging Face Dataset
Pre-training 558K original LLaVA-1.5 pre-training data 558K blip_laion_cc_sbu_558k.json
150K table recognition data (MMTab-pre) 150K MMTab-pre_pretrain_data_llava_format_150K.json
Instruction Fine-tuning 665K original LLaVA-1.5 fine-tuning data 665K llava_v1_5_mix665k.json
232K multimodal instruction tuning data of 14 tabular tasks (MMTab-instruct) 232K MMTab-instruct_sft_data_llava_format_232K.json

The merged pre-training and instruction fine-tuning data in the LLaVA data format can be found in the MMTab dataset, i.e., enhanced_llava_pretrain_data_708K.json and enhanced_llava_sft_data_898K.json, which can be directly used to train Table LLaVA.

Table LLaVA was trained on 8 A800 GPUs with 80GB memory. We use a similar set of hyperparameters as LLaVA v1.5 except that we increased the max sequence length from 2048 to 2560 to accommodate longer text sequences. The hyperparameters used in pretraining and finetuning are provided below.

Stage Trained Weights Global Batch Size Learning rate Epochs Max length Weight decay warmup ratio Deepspeed Stage
Pre-training vision-language connector 256 1e-3 1 2560 0 0.03 ZeRO-2
Instruction Fine-tuning base LLM and vision-language connector 128 2e-5 1 2048 0 0.03 ZeRO-3

4.3 Pre-training

  1. Download the original images for LLaVA v1.5 pretraining, i.e., images.zip from here. Put it under ./LLaVA-Pretrain/images and unzip it.
  2. Download MMTab-instruct_table_images_82K.zip and MMTab-pre_table_images_part_2_16K.zip from MMTab dataset. Put them under ./LLaVA-Pretrain/images and unzip them. Rename the IID_train_image dir to table_pretrain_part_1.
  3. Download enhanced_llava_pretrain_data_708K.json from MMTab dataset to ./LLaVA-Pretrain.
  4. The resulting data should be organized as follows:
LLaVA-Pretrain
β”œβ”€β”€ images
β”‚   β”œβ”€β”€ table_pretrain_part_1
|   β”œβ”€β”€ table_pretrain_part_2
|   β”œβ”€β”€ 00453
|   β”œβ”€β”€ 00019
|   β”œβ”€β”€ ...
|   └── 00095
└── enhanced_llava_pretrain_data_708K.json
  1. Training script with DeepSpeed ZeRO-2: pretrain_table_llava.sh. If you cannot automaticly download the base Vicuna v1.5 and ViT model through HuggingFace, you can download these models manually and set corresponding command-line parameters (model_name_or_path and vision_tower) to the local model paths. Once the pre-training is finished, the trained vision-language projector will be saved at the specified output_dir.

4.4 Fine-tuning

  1. Create 5 new folders under ./LLaVA-Finetune/images whose names are coco, gqa, ocr_vqa, textvqa and vg, respectively. Follow instructions from here to download images from these 5 datasets for LLaVA v1.5 fine-tuning. Put the zip files in the corresponding folders and unzip them.
  2. Download MMTab-instruct_table_images_82K.zip from MMTab dataset. Put it under ./LLaVA-Finetune/images/table_instructV and unzip it. Rename the resulting IID_train_image dir to images.
  3. Download enhanced_llava_sft_data_898K.json from MMTab dataset to ./LLaVA-Finetune.
  4. The resulting data should be organized as follows:
LLaVA-Finetune
β”œβ”€β”€ images
β”‚   β”œβ”€β”€ coco
|   |   └── train2017
|   β”œβ”€β”€ gqa
|   |   └── images
|   β”œβ”€β”€ ocr_vqa
|   |   └── images
|   β”œβ”€β”€ textvqa
|   |   └── train_images
|   β”œβ”€β”€ vg
|   |   β”œβ”€β”€ VG_100K
|   |   └── VG_100K_2
|   β”œβ”€β”€ table_instructV
|   |   └── images
└── enhanced_llava_sft_data_898K.json
  1. Training script with DeepSpeed ZeRO-3: continue_sft_table_llava.sh. Set the pretrain_mm_mlp_adapter parameter to the path of your pre-trained vision-language projector, such as ./pretrained_mm_projector/llava-v1.5-7b-with-table-pretrain/mm_projector.bin. The trained table llava model will be saved at the specified output_dir.

5. Inference

The inference data should be stored in the LLaVA's jsonl format. Each line in the input file corresponds to an input sample, which is a JSON string (generated by json.dumps()) of a Python dict. The sample format should look like:

{     "question_id": "TSD_test_item_17", # item_id
      "image": "TABMWP_24663.jpg", # corresponding image file
      "text": "This image displays a table. Could you provide me ...", # input text
      "category": "TABMWP_for_TSD" # {dataset_name}_for_{task_type}, which can be used to separate data of different benchmarks.
}

For inference on the MMTab-eval, download the 49K MMTab-eval test samples in the jsonl format (MMTab-eval_test_data_49K_llava_jsonl_format.jsonl) and its image files (MMTab-eval_table_images_23K.zip). Then create a folder named 'LLaVA-Inference' and organize the data as follows:

LLaVA-Inference
β”œβ”€β”€ MMTab-eval_test_data_49K_llava_jsonl_format.jsonl
└── all_test_image

Inference on multi-GPU: start_multicard_inference.sh. You can also inference on your own data. Remember adjust parameters like 'question-file' (input file path), 'image-folder' (image folder path) in the table_llava_inference.sh. The inference results (merge.jsonl) will be stored in the path of the 'answers-file' parameter, e.g., ./eval_results/answers/MMTab_eval/table-llava-v1.5-7b/merge.jsonl.

With the offical inference script, the inference result format in the merge.jsonl should look like:

{      'question_id': 'TABMWP_8', # item_id
       'prompt': 'Problem: \nHannah baked cookies each day ...', # input_prompt
       'text': 'Find the numbers in the table.\n\nSaturday: ...', # model_output
       'answer_id': 'jELcxSPcXHBj3xvHfm5r8T', # answer_id
       'model_id': 'table-llava-7b', # model_id
       'category': 'TABMWP_for_TQA'
} # item category

6. Evaluation

The evaluation scripts are stored in the MMTab-eval_evaluation folder. First, cd MMTab-eval_evaluation and pip install -r eval_requirements.txt to install necessary packages like 'Sacrebleu' for evaluation. For table recognition task, we use the PubTabNet's TEDS computation script for evaluation. Then, download the MMTab-eval test data (MMTab-eval_test_data_49K.json) and test tables (MMTab-eval_test_tables_23K.json), and put them into the MMTab-eval_evaluation folder together with the LLaVA's inference result (merge.jsonl). Use the MMTab_evaluation.ipynb notebook for automatic evaluation.

For the evaluation on the ToTTo test set, you need to organize the model output into a txt file and upload it to the offical ToTTo leaderboard.

7. Limitations and Future Directions

  1. Multilingual and multi-table scenarios. The proposed MMTab dataset mainly focus on the single table in English. The multi-table scenario with broader language coverage should be considered.
  2. Table images in the wild. MMTab is based on tables from academic table datasets and it contains diverse high-quality table images rendered by automatic scripts. Nevertheless, table images in the wild can be low-quality. For instance, blurred, handwritten or incomplete table images. To further bridge the gap between the academic research and the real application scenarios, more diversified table images from the wild could be collected in the future, and their corresponding instruction following data needs to be constructed.
  3. Improving image resolution. The supported image resolution of LLaVA-1.5 is relatively low and may limit the upper bound of its capacity. Luckily, with the emergence of MLLMs which possess higher and dynamic image resolution (e.g., LLaVA-Next and Qwen-VL), more powerful tabular MLLMs can be developed with the collected data.

TODOs

Citation

@misc{zheng2024multimodal,
      title={Multimodal Table Understanding}, 
      author={Mingyu Zheng and Xinwei Feng and Qingyi Si and Qiaoqiao She and Zheng Lin and Wenbin Jiang and Weiping Wang},
      year={2024},
      eprint={2406.08100},
      archivePrefix={arXiv},
      }
}