RFC00156: filtering of ocr line images data based on ocr confidence
Named Concepts
OCR Data: The text and metadata extracted from images using Optical Character Recognition technology in html files form.
OCR Confidence: A numerical score representing the accuracy of the OCR process for each extracted text segment.
Line Image: A subsection of a page image, corresponding to a line of text.
non word: a tibetan text that is mispelled
no bo word: a non tibetan word.
Summary
This RFC proposes to create a training data of line images and its corresponding transcript in csv file from the given of input of page images in work wise folder and it's corresponding image page description in html files.
The output will be categorized into three sections based on ocr confidence.
ocr-confidence : 50-75%, 76-90%, 91-100%. with their non word, and non bo word count.
Dependencies
BeautifulSoup4 for HTML parsing
Pillow for image processing
botok for tibetan text filtering.
Infrastructures
an S3 bucket or equivalent for storing processed line images and metadata CSV files
Design Illustrations
input:
repo folder: a structure repo folder with html files of each page
work folder: a structured work folder with page images
output:
work folder: structured work folder with line images in page id folder.
work csv files: csv files for transcript of line images, categories into three range of ocr confidence.
Justification
The proposed design was selected for its simplicity, effectiveness in improving data quality, and ease of integration with existing OCR processing workflows. Alternatives, such as more complex machine learning-based post-processing, were considered but deemed unnecessary for the current scope and requirements.
Testing
Unit Tests: To cover individual functions, particularly parsing and image processing.
Integration Tests: To ensure the system works as a whole, especially the flow from HTML parsing to line image cropping and CSV updating.
Implementation Steps
List all the steps involved during implementation.
[ ] OpenPecha/create_ocr_data#8
Estimated time: 2 hours
Actual time:
[ ] OpenPecha/create_ocr_data#9
Estimated time: 1 hour
Actual time:
[ ] OpenPecha/create_ocr_data#10
Estimated time: 1 hour
Actual time:
[ ] OpenPecha/create_ocr_data#11
Estimated time: 3 hour
Actual time:
[ ] OpenPecha/create_ocr_data#12
Estimated time: 1 hour
Actual time:
[ ] OpenPecha/create_ocr_data#13
Estimated time: 1 hour
Actual time:
RFC00156: filtering of ocr line images data based on ocr confidence
Named Concepts
Summary
This RFC proposes to create a training data of line images and its corresponding transcript in csv file from the given of input of page images in work wise folder and it's corresponding image page description in html files. The output will be categorized into three sections based on ocr confidence. ocr-confidence : 50-75%, 76-90%, 91-100%. with their non word, and non bo word count.
Dependencies
Infrastructures
Design Illustrations
output:
Justification
The proposed design was selected for its simplicity, effectiveness in improving data quality, and ease of integration with existing OCR processing workflows. Alternatives, such as more complex machine learning-based post-processing, were considered but deemed unnecessary for the current scope and requirements.
Testing
Implementation Steps
List all the steps involved during implementation.
Reviewed By
@ta4tsering