vedastr is an open source scene text recognition toolbox based on PyTorch. It is designed to be flexible in order to support rapid implementation and evaluation for scene text recognition task.
Modular design\ We decompose the scene text recognition framework into different components and one can easily construct a customized scene text recognition framework by combining different modules.
Flexibility\ vedastr is flexible enough to be able to easily change the components within a module.
Module expansibility\ It is easy to integrate a new module into the vedastr project.
Support of multiple frameworks\ The toolbox supports several popular scene text recognition framework, e.g., CRNN, TPS-ResNet-BiLSTM-Attention, Transformer, etc.
Good performance\ We re-implement the best model in deep-text-recognition-benchmark and get better average accuracy. What's more, we implement a simple baseline(ResNet-FC) and the performance is acceptable.
This project is released under Apache 2.0 license.
Note:
MODEL | CASE SENSITIVE | IIIT5k_3000 | SVT | IC03_867 | IC13_1015 | IC15_2077 | SVTP | CUTE80 | AVERAGE |
---|---|---|---|---|---|---|---|---|---|
ResNet-CTC | False | 87.97 | 84.54 | 90.54 | 88.28 | 67.99 | 72.71 | 77.08 | 81.58 |
ResNet-FC | False | 88.80 | 88.41 | 92.85 | 90.34 | 72.32 | 79.38 | 76.74 | 84.24 |
TPS-ResNet-BiLSTM-Attention | False | 90.93 | 88.72 | 93.89 | 92.12 | 76.41 | 80.31 | 79.51 | 86.49 |
Small-SATRN | False | 91.97 | 88.10 | 94.81 | 93.50 | 75.64 | 83.88 | 80.90 | 87.19 |
TPS : Spatial transformer network
Small-SATRN: On Recognizing Texts of Arbitrary Shapes with 2D Self-Attention, training phase is case sensitive while testing phase is case insensitive.
AVERAGE : Average accuracy over all test datasets
CASE SENSITIVE : If true, the output is case sensitive and contain common characters. If false, the output is not case sensetive and contains only numbers and letters.
We have tested the following versions of OS and softwares:
conda create -n vedastr python=3.6 -y
conda activate vedastr
conda install pytorch torchvision -c pytorch
git clone https://github.com/Media-Smart/vedastr.git
cd vedastr
vedastr_root=${PWD}
pip install -r requirements.txt
Download Lmdb data from deep-text-recognition-benchmark, which contains training, validation and evaluation data. Note: we use the ST dataset released by ASTER.
Make directory data as follows:
cd ${vedastr_root}
mkdir ${vedastr_root}/data
data
└── data_lmdb_release
├── evaluation
├── training
│ ├── MJ
│ │ ├── MJ_test
│ │ ├── MJ_train
│ │ └── MJ_valid
│ └── ST
└── validation
Modify configuration files in configs/ according to your needs(e.g. configs/tps_resnet_bilstm_attn.py).
# train using GPUs with gpu_id 0, 1, 2, 3
python tools/train.py configs/tps_resnet_bilstm_attn.py "0, 1, 2, 3"
Snapshots and logs by default will be generated at ${vedastr_root}/workdir/name_of_config_file
(you can specify workdir in config files).
Modify configuration as you wish(e.g. configs/tps_resnet_bilstm_attn.py).
# test using GPUs with gpu_id 0, 1
./tools/dist_test.sh configs/tps_resnet_bilstm_attn.py path/to/checkpoint.pth "0, 1"
# inference using GPUs with gpu_id 0
python tools/inference.py configs/tps_resnet_bilstm_attn.py checkpoint_path img_path "0"
Install volksdep following the official instructions
Benchmark (optional)
# Benchmark model using GPU with gpu_id 0
CUDA_VISIBLE_DEVICES="0" python tools/benchmark.py configs/resnet_ctc.py checkpoint_path out_path --dummy_input_shape "3,32,100"
More available arguments are detailed in tools/deploy/benchmark.py.
The result of resnet_ctc is as follows(test device: Jetson AGX Xavier, CUDA:10.2):
framework | version | input shape | data type | throughput(FPS) | latency(ms) |
---|---|---|---|---|---|
PyTorch | 1.5.0 | (1, 1, 32, 100) | fp32 | 64 | 15.81 |
TensorRT | 7.1.0.16 | (1, 1, 32, 100) | fp32 | 109 | 9.66 |
PyTorch | 1.5.0 | (1, 1, 32, 100) | fp16 | 113 | 10.75 |
TensorRT | 7.1.0.16 | (1, 1, 32, 100) | fp16 | 308 | 3.55 |
TensorRT | 7.1.0.16 | (1, 1, 32, 100) | int8(entropy_2) | 449 | 2.38 |
# export model to onnx using GPU with gpu_id 0
CUDA_VISIBLE_DEVICES="0" python tools/torch2onnx.py configs/resnet_ctc.py checkpoint_path --dummy_input_shape "3,32,100" --dynamic_shape
More available arguments are detailed in tools/torch2onnx.py.
Inference SDK
You can refer to FlexInfer for details.
If you use this toolbox or benchmark in your research, please cite this project.
@misc{2020vedastr,
title = {vedastr: A Toolbox for Scene Text Recognition},
author = {Sun, Jun and Cai, Hongxiang and Xiong, Yichao},
url = {https://github.com/Media-Smart/vedastr},
year = {2020}
}
This repository is currently maintained by Jun Sun(@ChaseMonsterAway), Hongxiang Cai (@hxcai), Yichao Xiong (@mileistone).