VDIGPKU / IterNet

14 stars 1 forks source link

IterVM: Iterative Vision Modeling Module for Scene Text Recognition

The official code of IterNet.

We propose IterVM, an iterative approach for visual feature extraction which can significantly improve scene text recognition accuracy. IterVM repeatedly uses the high-level visual feature extracted at the previous iteration to enhance the multi-level features extracted at the subsequent iteration.

framework

Runtime Environment

pip install -r requirements.txt

Note: fastai==1.0.60 is required.

Datasets

Training datasets (Click to expand) 1. [MJSynth](http://www.robots.ox.ac.uk/~vgg/data/text/) (MJ): - Use `tools/create_lmdb_dataset.py` to convert images into LMDB dataset - [LMDB dataset BaiduNetdisk(passwd:n23k)](https://pan.baidu.com/s/1mgnTiyoR8f6Cm655rFI4HQ) 2. [SynthText](http://www.robots.ox.ac.uk/~vgg/data/scenetext/) (ST): - Use `tools/crop_by_word_bb.py` to crop images from original [SynthText](http://www.robots.ox.ac.uk/~vgg/data/scenetext/) dataset, and convert images into LMDB dataset by `tools/create_lmdb_dataset.py` - [LMDB dataset BaiduNetdisk(passwd:n23k)](https://pan.baidu.com/s/1mgnTiyoR8f6Cm655rFI4HQ) 3. [WikiText103](https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-103-v1.zip), which is only used for pre-trainig language models: - Use `notebooks/prepare_wikitext103.ipynb` to convert text into CSV format. - [CSV dataset BaiduNetdisk(passwd:dk01)](https://pan.baidu.com/s/1yabtnPYDKqhBb_Ie9PGFXA)
Evaluation datasets (Click to expand) - Evaluation datasets, LMDB datasets can be downloaded from [BaiduNetdisk(passwd:1dbv)](https://pan.baidu.com/s/1RUg3Akwp7n8kZYJ55rU5LQ), [GoogleDrive](https://drive.google.com/file/d/1dTI0ipu14Q1uuK4s4z32DqbqF3dJPdkk/view?usp=sharing). 1. ICDAR 2013 (IC13) 2. ICDAR 2015 (IC15) 3. IIIT5K Words (IIIT) 4. Street View Text (SVT) 5. Street View Text-Perspective (SVTP) 6. CUTE80 (CUTE)
The structure of `data` directory (Click to expand) - The structure of `data` directory is ``` data ├── charset_36.txt ├── evaluation │   ├── CUTE80 │   ├── IC13_857 │   ├── IC15_1811 │   ├── IIIT5k_3000 │   ├── SVT │   └── SVTP ├── training │   ├── MJ │   │   ├── MJ_test │   │   ├── MJ_train │   │   └── MJ_valid │   └── ST ├── WikiText-103.csv └── WikiText-103_eval_d1.csv ```

Pretrained Models

Get the pretrained models from GoogleDrive. Performances of the pretrained models are summaried as follows:

Model IC13 SVT IIIT IC15 SVTP CUTE AVG
IterNet 97.9 95.1 96.9 87.7 90.9 91.3 93.8

Training

  1. Pre-train vision model
    CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python main.py --config=configs/pretrain_vm.yaml
  2. Pre-train language model
    CUDA_VISIBLE_DEVICES=0,1,2,3 python main.py --config=configs/pretrain_language_model.yaml
  3. Train IterNet
    CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python main.py --config=configs/train_iternet.yaml

    Note:

    • You can set the checkpoint path for vision model (vm) and language model separately for specific pretrained model, or set to None to train from scratch

Evaluation

CUDA_VISIBLE_DEVICES=0 python main.py --config=configs/train_iternet.yaml --phase test --image_only

Additional flags:

Run Demo

google colab logo

python demo.py --config=configs/train_iternet.yaml --input=figures/demo

Additional flags:

Citation

If you find our method useful for your reserach, please cite

@article{chu2022itervm,
  title={IterVM: Iterative Vision Modeling Module for Scene Text Recognition},
  author={Chu, Xiaojie and Wang, Yongtao},
  journal={26th International Conference on Pattern Recognition (ICPR)},
  year={2022}
}

License

The project is only free for academic research purposes, but needs authorization for commerce. For commerce permission, please contact wyt@pku.edu.cn.

Acknowledgements

This project is based on ABINet. Thanks for their great works.