Arabic OCR
- OCR system for Arabic language that converts images of typed text to machine-encoded text.
- The system aims to solve a simpler problem of OCR with images that contain only Arabic characters (check the dataset link below to see a sample of the images).
Important Note
The system currently supports only letters (29 letters) ا-ى , لا (no numbers or special symbols).
Setup
Install python then run this command:
pip install -r requirements.txt
Run
- Put the images in src/test directory
- Go to src directory and run the following command
python OCR.py
- Output folder will be created with:
- text folder which has text files corresponding to the images.
- running_time file which has the time taken to process each image.
Pipeline
Dataset
- Link to dataset of images and the corresponding text: here.
- We used 1000 images to generate character dataset that we used for training.
Examples
Line Segmentation
Word Segmentation
Character Segmentation
Testing
NOTE: Make sure you have a folder with the truth output with same file names to compare it with the predicted text.
From within src
folder run:
python edit.py 'output/text' 'truth'
Performance
- Average accuracy: 95%.
- Average time per image: 16 seconds.
NOTE
We achieved these results when we used only the flatten image as feature.
References
-
An Efficient, Font Independent Word and Character Segmentation Algorithm for Printed Arabic Text.
-
A Robust Line Segmentation Algorithm for Arabic Printed Text with Diacritics.
-
Arabic Character Segmentation Using Projection Based Approach with Profile's Amplitude Filter
.