IterVM: Iterative Vision Modeling Module for Scene Text Recognition

The official code of IterNet.

We propose IterVM, an iterative approach for visual feature extraction which can significantly improve scene text recognition accuracy. IterVM repeatedly uses the high-level visual feature extracted at the previous iteration to enhance the multi-level features extracted at the subsequent iteration.

framework

Runtime Environment

pip install -r requirements.txt

Note: fastai==1.0.60 is required.

Datasets

Training datasets (Click to expand)

1. [MJSynth](http://www.robots.ox.ac.uk/~vgg/data/text/) (MJ): - Use `tools/create_lmdb_dataset.py` to convert images into LMDB dataset - [LMDB dataset BaiduNetdisk(passwd:n23k)](https://pan.baidu.com/s/1mgnTiyoR8f6Cm655rFI4HQ) 2. [SynthText](http://www.robots.ox.ac.uk/~vgg/data/scenetext/) (ST): - Use `tools/crop_by_word_bb.py` to crop images from original [SynthText](http://www.robots.ox.ac.uk/~vgg/data/scenetext/) dataset, and convert images into LMDB dataset by `tools/create_lmdb_dataset.py` - [LMDB dataset BaiduNetdisk(passwd:n23k)](https://pan.baidu.com/s/1mgnTiyoR8f6Cm655rFI4HQ) 3. [WikiText103](https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-103-v1.zip), which is only used for pre-trainig language models: - Use `notebooks/prepare_wikitext103.ipynb` to convert text into CSV format. - [CSV dataset BaiduNetdisk(passwd:dk01)](https://pan.baidu.com/s/1yabtnPYDKqhBb_Ie9PGFXA)

Evaluation datasets (Click to expand)

- Evaluation datasets, LMDB datasets can be downloaded from [BaiduNetdisk(passwd:1dbv)](https://pan.baidu.com/s/1RUg3Akwp7n8kZYJ55rU5LQ), [GoogleDrive](https://drive.google.com/file/d/1dTI0ipu14Q1uuK4s4z32DqbqF3dJPdkk/view?usp=sharing). 1. ICDAR 2013 (IC13) 2. ICDAR 2015 (IC15) 3. IIIT5K Words (IIIT) 4. Street View Text (SVT) 5. Street View Text-Perspective (SVTP) 6. CUTE80 (CUTE)

The structure of `data` directory (Click to expand)

- The structure of `data` directory is ``` data ├── charset_36.txt ├── evaluation │ ├── CUTE80 │ ├── IC13_857 │ ├── IC15_1811 │ ├── IIIT5k_3000 │ ├── SVT │ └── SVTP ├── training │ ├── MJ │ │ ├── MJ_test │ │ ├── MJ_train │ │ └── MJ_valid │ └── ST ├── WikiText-103.csv └── WikiText-103_eval_d1.csv ```

Pretrained Models

Get the pretrained models from GoogleDrive. Performances of the pretrained models are summaried as follows:

Model	IC13	SVT	IIIT	IC15	SVTP	CUTE	AVG
IterNet	97.9	95.1	96.9	87.7	90.9	91.3	93.8

Training

Pre-train vision model

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python main.py --config=configs/pretrain_vm.yaml

Pre-train language model

CUDA_VISIBLE_DEVICES=0,1,2,3 python main.py --config=configs/pretrain_language_model.yaml

Train IterNet
```
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python main.py --config=configs/train_iternet.yaml
```
Note:
- You can set the checkpoint path for vision model (vm) and language model separately for specific pretrained model, or set to None to train from scratch

Evaluation

CUDA_VISIBLE_DEVICES=0 python main.py --config=configs/train_iternet.yaml --phase test --image_only

Additional flags:

--checkpoint /path/to/checkpoint set the path of evaluation model
--test_root /path/to/dataset set the path of evaluation dataset
--model_eval [alignment|vision] which sub-model to evaluate
--image_only disable dumping visualization of attention masks

Run Demo

python demo.py --config=configs/train_iternet.yaml --input=figures/demo

Additional flags:

--config /path/to/config set the path of configuration file
--input /path/to/image-directory set the path of image directory or wildcard path, e.g, --input='figs/test/*.png'
--checkpoint /path/to/checkpoint set the path of trained model
--cuda [-1|0|1|2|3...] set the cuda id, by default -1 is set and stands for cpu
--model_eval [alignment|vision] which sub-model to use
--image_only disable dumping visualization of attention masks

Citation

If you find our method useful for your reserach, please cite

@article{chu2022itervm,
  title={IterVM: Iterative Vision Modeling Module for Scene Text Recognition},
  author={Chu, Xiaojie and Wang, Yongtao},
  journal={26th International Conference on Pattern Recognition (ICPR)},
  year={2022}
}

License

The project is only free for academic research purposes, but needs authorization for commerce. For commerce permission, please contact wyt@pku.edu.cn.

Acknowledgements

This project is based on ABINet. Thanks for their great works.

VDIGPKU / IterNet

readme