A real-time scene text recognition algorithm. Our system is able to recognize text in unconstrain background.
This algorithm is based on several papers, and was implemented in C/C++.
Enviroment and dependency
opencv
directory into C:\tools
choco install opencv
, orCMakeLists.txt
and change WIN_OPENCV_CONFIG_PATH
to where you have itcd Scene-text-recognition
mkdir build-win
cd build-win
cmake .. -G "Visual Studio 15 2017 Win64"
cmake --build . --config Release
cd ..
dir | findstr scene
scene_text_recognition.exe
binary, use its wrapper script; for example:
.\scene_text_recognition.bat -i res\ICDAR2015_test\img_6.jpg
cd Scene-text-recognition
mkdir build-linux
cd build-linux
cmake ..
cmake --build .
cd ..
ls | grep scene
./scene_text_recognition -i res/ICDAR2015_test/img_6.jpg
The executable file scene_text_recognition
must ultimately exist in the project root directory (i.e., next to classifier/
, dictionary/
etc.)
./scene_text_recognition -v: take default webcam as input
./scene_text_recognition -v [video]: take a video as input
./scene_text_recognition -i [image]: take an image as input
./scene_text_recognition -i [path]: take folder with images as input,
./scene_text_recognition -l [image]: demonstrate "Linear Time MSER" Algorithm
./scene_text_recognition -t detection: train text detection classifier
./scene_text_recognition -t ocr: train text recognition(OCR) classifier
res/pos
, non-text data to res/neg
1.jpg
, 2.jpg
, 3.jpg
, and so on.training
folder exist./scene_text_recognition -t detection
mkdir training
./scene_text_recognition -t detection
training
folderres/ocr_training_data/
[Font Name]/[Font Type]/[Category]/[Character.jpg]
, for instance Time_New_Roman/Bold/lower/a.jpg
. You can refer to res/ocr_training_data.zip
training
folder exist, and put svm-train
to root folder (svm-train will be build by the system and should be found at build/)./scene_text_recognition -t ocr
mkdir training
mv svm-train scene-text-recognition/
scene_text_recognition -t ocr
training
folderThe algorithm is based on an region detector called Extremal Region (ER), which is basically the superset of famous region detector MSER. We use ER to find text candidates. The ER is extracted by Linear-time MSER algorithm. The pitfall of ER is repeating detection, therefore we remove most of repeating ERs with non-maximum suppression. We estimate the overlapped between ER based on the Component tree. and calculate the stability of every ER. Among the same group of overlapped ER, only the one with maximum stability is kept. After that we apply a 2-stages Real-AdaBoost to fliter non-text region. We choose Mean-LBP as feature because it's faster compare to other features. The suviving ERs are then group together to make the result from character-level to word level, which is more instinct for human. Our next step is to apply an OCR to these detected text. The chain-code of the ER is used as feature and the classifier is trained by SVM. We also introduce several post-process such as optimal-path selection and spelling check to make the recognition result better.
For text classification, the training data contains 12,000 positive samples, mostly extract from ICDAR 2003 and ICDAR 2015 dataset. the negative sample are extracted from random images with a bootstrap process. As for OCR classification, the training data is consist of purely synthetic letters, including 28 different fonts.
The system is able to detect text in real-time(30FPS) and recognize text in nearly real-time(8~15 FPS, depends on number of texts) for a 640x480 resolution image on a Intel Core i7 desktop computer. The algorithm's end-to-end text detection accuracy on ICDAR dataset 2015 is roughly 70% with fine tune, and end-to-end recognition accuracy is about 30%.
The green pixels are so called boundry pixels, which are pushed into stacks. Each stack stand for a gray level, and pixels will be pushed according to their gary level.