ZephyrZhuQi / ssbaseline

Simple is not Easy: A Simple Strong Baseline for TextVQA and TextCaps[AAAI2021]
56 stars 5 forks source link

Simple is not Easy: A Simple Strong Baseline for TextVQA and TextCaps

Here is the code for ssbassline model. We also provide OCR results/features/models. The code is built on top of M4C, where more detailed information can also be found.


If you use ssbaseline in your work, please cite:

  title={Simple is not Easy: A Simple Strong Baseline for TextVQA and TextCaps},
  author={Zhu, Qi and Gao, Chenyu and Wang, Peng and Wu, Qi},
  journal={arXiv preprint arXiv:2012.05153},


First install the repo using

git clone https://github.com/ZephyrZhuQi/ssbaseline.git ~/ssbaseline
cd ~/ssbaseline
python setup.py build develop

Getting Data

We provide SBD-Trans OCR for TextVQA and ST-VQA datasets. The corresponding OCR Faster R-CNN features and Recog-CNN features are also released.

Datasets ImDBs Object Faster R-CNN Features OCR Faster R-CNN Features OCR Recog-CNN Features
TextVQA TextVQA ImDB Open Images TextVQA SBD-Trans OCRs TextVQA SBD-Trans OCRs

Pretrained Models

We release the following pretrained models for ssbaseline on TextVQA.

For the TextVQA dataset, we release: ssbaseline trained with ST-VQA as additional data (our best model) with SBD-Trans.

Datasets Config Files (under configs/vqa/) Pretrained Models Metrics Notes
TextVQA (m4c_textvqa) m4c_textvqa/m4c_sbd.yml(need to modify: add data imdb and feature files of stvqa, see m4c_with_stvqa.yml for reference) ssbaseline_with_stvqa val accuracy - 45.53%; test accuracy - 45.66% SBD-Trans OCRs; ST-VQA as additional data

Training and Evaluation

Please follow the M4C README for the training and evaluation of the M4C model on each dataset.

Questions and Answers from emails

Question: Feature Extraction(文章中各部分feature提取的代码有开源吗,因为要用在一些别的数据上希望可以自己提取特征)

Answer: There are various features, and their corresponding repositories are shown below: (各部分feature提取的代码比较多,我把我用到的给你说一下:)

  1. To get the feature from OCR bounding box, you need to modify the maskrcnn detection framework by replacing the RPN layer with the hardcoded bounding box. There is a repo, and you should use it together with the feature extraction script.
  2. 提取ocr bounding box中的feature,这种需要修改mask rcnn检测框架,把RPN层替换成bounding box,我使用的是这个repo中的代码,需要配合提取feature的脚本使用。
  3. To get the feature from OBJ bounding box, you don't need modify maskrcnn framework this time, which is this repo. The corresponding extraction script.
  4. 提取obj faster rcnn feature,这个不需要修改检测框架,直接提取就好,检测框架脚本
  5. To get the OCR bounding box, we use this repo, and the model we used is MLT 2017.
  6. 获得ocr检测框的代码,使用的模型是MLT 2017。
  7. Based on the OCR bounding box, to get the OCR recognition result & extract features, the code is not mine and not opensourced yet.
  8. 基于ocr检测框获得文本识别结果 & 提取ocr Recog-CNN feature,这个文本识别的代码不是我写的,也没有开源,所以目前没法分享给你