OpenPecha / Requests

RFWs and RFCs for all OpenPecha repositories
0 stars 0 forks source link

RFC00156: filtering of ocr line images data based on ocr confidence. #476

Open gangagyatso4364 opened 3 months ago

gangagyatso4364 commented 3 months ago

RFC00156: filtering of ocr line images data based on ocr confidence

Named Concepts

  1. OCR Data: The text and metadata extracted from images using Optical Character Recognition technology in html files form.
  2. OCR Confidence: A numerical score representing the accuracy of the OCR process for each extracted text segment.
  3. Line Image: A subsection of a page image, corresponding to a line of text.
  4. non word: a tibetan text that is mispelled
  5. no bo word: a non tibetan word.

Summary

This RFC proposes to create a training data of line images and its corresponding transcript in csv file from the given of input of page images in work wise folder and it's corresponding image page description in html files. The output will be categorized into three sections based on ocr confidence. ocr-confidence : 50-75%, 76-90%, 91-100%. with their non word, and non bo word count.

Dependencies

  1. BeautifulSoup4 for HTML parsing
  2. Pillow for image processing
  3. botok for tibetan text filtering.

Infrastructures

  1. an S3 bucket or equivalent for storing processed line images and metadata CSV files

Design Illustrations

ocr input:

  1. repo folder: a structure repo folder with html files of each page
  2. work folder: a structured work folder with page images

output:

  1. work folder: structured work folder with line images in page id folder.
  2. work csv files: csv files for transcript of line images, categories into three range of ocr confidence.

Justification

The proposed design was selected for its simplicity, effectiveness in improving data quality, and ease of integration with existing OCR processing workflows. Alternatives, such as more complex machine learning-based post-processing, were considered but deemed unnecessary for the current scope and requirements.

Testing

  1. Unit Tests: To cover individual functions, particularly parsing and image processing.
  2. Integration Tests: To ensure the system works as a whole, especially the flow from HTML parsing to line image cropping and CSV updating.

Implementation Steps

List all the steps involved during implementation.

Reviewed By

@ta4tsering

ta4tsering commented 3 months ago

please include to get the percentage of inference text validity for those sections

gangagyatso4364 commented 3 months ago

okay

ta4tsering commented 3 months ago

@kaldan007 testing one two three