This is the repository of my final project in computer science for the Ministry of Education.
Develop an algorithm that takes in an image file, and returns the text found in the image as a string. With the algorithm, build an easy-to-use app that'll allow users to scan or upload an image of a document or handwritten text, and "extract" the text from them.
The server running the algorithm is comprised of many modules, each tasked with a different part of the objective.
Server
├── base_model.py
├── bounding_rects.py
├── config_tf.py
├── consts.py
├── evaluate_model.py
├── hough_rect.py
├── model_evaluator.py
├── noise_remover.h5
├── noise_remover.py
├── ocr.py
├── ocr_model.h5
├── ocr_model.py
├── preprocessing.py
├── requirements.txt
├── server.py
└── train_models.py
The Python server communicates through HTTP using the FastAPI library in Python.
/find_page_points
- To find the region-of-interest of the image (usually the page). Takes in a JSON object, that has one key-value pair - the key is "b64image", and the value is the image encoded as base64 string. Returns a JSON with a list of four objects ("points"), each with x and y position on the image./image_to_text
- To detect the text in an image. Takes in a JSON object, that holds the image (key is "b64image") as a base64 string. Also optional is a list of points of the region-of-interest (key is "points") encoded as JSON object with integer x and y components. If "points" isn't provided, the server will try to find them automatically (if that fails, process the entire image)./text_to_docx/{text}
- To put the text in a Microsoft word (DOCX) document (used by the app).When a base64 image is received by the server, first the image is decoded and transformed into a 2D grayscale image represented by a NumPy array (see decode_image
in server.py
for possible errors and their appropriate responses).
Afterwards, the corners of the page (ROI) are calculated. Lines are detected in the image using the Probabilistic Hough Line Transform algorithm, and then the points-of-intersection between the lines are calculated.
Using the points, the original image is transformed to only include the area enclosed by these points, and some other preprocessing filters and transformations are applied. If no points are found, the whole image is preprocessed.
In the preprocessed image, the bounding rectangles of the contours of each individual characters are found. The rectangles are sorted to the correct order of characters present in the image, and spaces are detected between each sequence of characters (word).
Each individual character is then cut and placed into it's own NumPy array, which is passed through the models. First the image of the character is passed to a denoising autoencoder, which denoises and softens the image. Then, they are passed to the classifier model. Said model is built using TensorFlow's Keras API, and can be loaded from the HDF5 file. The machine-learning model is a CNN (Convolutional Neural Network) comprised of many layers, and was trained with over 300,000 images from the EMNIST database (Extended Modified National Institute of Standards and Technology database - using the merged version). The model is able to classify an image of a character to a 92.91% accuracy. The model outputs only lowercase letters and digits, but the input may also be an uppercase character.
If you wish to train the model by yourself, download the image files, change the train and validation paths in consts.py
and run train_models.py
(be advised - the process may take over 24 hours if ran on a CPU, and it will operate better on a GPU).
After joining the characters outputted from the classifier, spellchecking is performed on the text, to fix any other errors which occurred during the classification process. The output text is then returned to the client.
The classification process will run much faster on a GPU, and using a GPU is recommended since TensorFlow takes advantage of GPU acceleration.
To run the server locally (make sure you have Python 3.9 installed - TensorFlow 2.5 doesn't support other versions of Python right now):
git clone https://github.com/TomerGibor/Final-Project-OCR.git
cd ./Final-Project-OCR/Server
pip install -r requirements.txt
python server.py
The app is written in the Flutter framework using the Dart programming language. The app is named Editable, since you can take a picture of some text and extract the text from the image and edit it digitally.
App/Editable
└── lib
├── helpers
│ ├── db_helper.dart
│ ├── file_helper.dart
│ └── http_helper.dart
├── main.dart
├── providers
│ ├── editables.dart
│ └── settings.dart
├── screens
│ ├── add_editable_screen.dart
│ ├── edit_editable_screen.dart
│ ├── home_screen.dart
│ ├── select_points_on_image_screen.dart
│ └── settings_screen.dart
└── widgets
├── app_drawer.dart
├── editable_item.dart
├── error_dialog.dart
└── image_input.dart
Use the +
button to add an Editable. Select an image from device storage or take a picture with the camera and press the Submit
button. Then, if it is enabled in settings, you will be able to check if the page corners are set properly, and if not you will be able to correct then yourself. Press Confirm
and after a few seconds (or minutes - depends on how much text there is and the speed of your CPU or GPU) the Editable will be added to the list!
You can also customize the looks of the app in the settings!
On the Editable you can perform many operations: edit, copy, share, download as word and translate.
Make sure you have Flutter installed on your system before trying to run the following code:
git clone https://github.com/TomerGibor/Final-Project-OCR.git
cd ./Final-Project-OCR/App/Editable
flutter run
To use the app with the server, change the URL of the server in the http_helper.dart
module to your server URL.