OpenPecha / Requests

RFWs and RFCs for all OpenPecha repositories
0 stars 0 forks source link

[RFC0145]:Filter Norbuketaka OCR training data. #421

Open gangagyatso4364 opened 8 months ago

gangagyatso4364 commented 8 months ago

RFC0145: Filter Norbuketaka OCR training data.

Named Concepts

Pillow: The Python Imaging Library adds image processing capabilities to your Python interpreter. This library provides extensive file format support, an efficient internal representation, and fairly powerful image processing capabilities.

Summary

This RFC details the approach for filtering OCR training data obtained from Norbuketaka. The goal is to refine the dataset by eliminating images with dimensions where the length is shorter than the height, and texts that are too short (one or two characters) or contain numbers and other non tibetan text. also reject image text that have non word tibetan text and whose corresponding image id is not present in json file.

Dependencies

1.Pillow: For image processing and dimension analysis.

  1. Botok: it tokenizes Tibetan text into words with optional attributes such as lemma, POS, clean form.

Infrastructures

Design Illustrations

Untitled (2)

Justification

The chosen design focuses on accuracy and efficiency:

  1. Pillow is selected for its robust image processing capabilities, ensuring precise dimension analysis.
  2. Filtering criteria (dimension and text content checks) directly address the project's quality requirements.
  3. Alternative approaches, like more lenient filtering criteria, could result in lower quality training data, adversely affecting OCR model performance.

Testing

1.Unit Testing: To ensure each function (image dimension check, text length, and character checks) works as expected. 2.Integration Testing: To verify the complete workflow from data input to filtered output functions correctly. 3.Validation: A subset of data will be manually reviewed to ensure the filtering process meets quality standards.

Implementation Steps

List all the steps involved during implementation.

Reviewed By

@ta4tsering