gangagyatso4364 commented 3 months ago

RFC00156: filtering of ocr line images data based on ocr confidence

Named Concepts

OCR Data: The text and metadata extracted from images using Optical Character Recognition technology in html files form.
OCR Confidence: A numerical score representing the accuracy of the OCR process for each extracted text segment.
Line Image: A subsection of a page image, corresponding to a line of text.
non word: a tibetan text that is mispelled
no bo word: a non tibetan word.

Summary

This RFC proposes to create a training data of line images and its corresponding transcript in csv file from the given of input of page images in work wise folder and it's corresponding image page description in html files. The output will be categorized into three sections based on ocr confidence. ocr-confidence : 50-75%, 76-90%, 91-100%. with their non word, and non bo word count.

Dependencies

BeautifulSoup4 for HTML parsing
Pillow for image processing
botok for tibetan text filtering.

Infrastructures

an S3 bucket or equivalent for storing processed line images and metadata CSV files

Design Illustrations

ocr input:

repo folder: a structure repo folder with html files of each page
work folder: a structured work folder with page images

output:

work folder: structured work folder with line images in page id folder.
work csv files: csv files for transcript of line images, categories into three range of ocr confidence.

Justification

The proposed design was selected for its simplicity, effectiveness in improving data quality, and ease of integration with existing OCR processing workflows. Alternatives, such as more complex machine learning-based post-processing, were considered but deemed unnecessary for the current scope and requirements.

Testing

Unit Tests: To cover individual functions, particularly parsing and image processing.
Integration Tests: To ensure the system works as a whole, especially the flow from HTML parsing to line image cropping and CSV updating.

Implementation Steps

List all the steps involved during implementation.

[ ] OpenPecha/create_ocr_data#8 Estimated time: 2 hours Actual time:
[ ] OpenPecha/create_ocr_data#9 Estimated time: 1 hour Actual time:
[ ] OpenPecha/create_ocr_data#10 Estimated time: 1 hour Actual time:
- [ ] OpenPecha/create_ocr_data#11 Estimated time: 3 hour Actual time:
[ ] OpenPecha/create_ocr_data#12 Estimated time: 1 hour Actual time:
[ ] OpenPecha/create_ocr_data#13 Estimated time: 1 hour Actual time:

Reviewed By

@ta4tsering

ta4tsering commented 3 months ago

please include to get the percentage of inference text validity for those sections

gangagyatso4364 commented 3 months ago

okay

ta4tsering commented 3 months ago

@kaldan007 testing one two three

OpenPecha / Requests