Pillow: The Python Imaging Library adds image processing capabilities to your Python interpreter. This library provides extensive file format support, an efficient internal representation, and fairly powerful image processing capabilities.
Summary
This RFC details the approach for filtering OCR training data obtained from Norbuketaka. The goal is to refine the dataset by eliminating images with dimensions where the length is shorter than the height, and texts that are too short (one or two characters) or contain numbers and other non tibetan text. also reject image text that have non word tibetan text and whose corresponding image id is not present in json file.
Dependencies
1.Pillow: For image processing and dimension analysis.
Botok: it tokenizes Tibetan text into words with optional attributes such as lemma, POS, clean form.
Infrastructures
Design Illustrations
Justification
The chosen design focuses on accuracy and efficiency:
Pillow is selected for its robust image processing capabilities, ensuring precise dimension analysis.
Filtering criteria (dimension and text content checks) directly address the project's quality requirements.
Alternative approaches, like more lenient filtering criteria, could result in lower quality training data, adversely affecting OCR model performance.
Testing
1.Unit Testing: To ensure each function (image dimension check, text length, and character checks) works as expected.
2.Integration Testing: To verify the complete workflow from data input to filtered output functions correctly.
3.Validation: A subset of data will be manually reviewed to ensure the filtering process meets quality standards.
Implementation Steps
List all the steps involved during implementation.
[ ] OpenPecha/image-to-text#1
Estimated time: 0.5 hour
Actual time:
[ ] OpenPecha/image-to-text#1
Estimated time: 0.5 hour
Actual time:
[ ] OpenPecha/image-to-text#2
Estimated time: 1 hour
Actual time:
[ ] OpenPecha/image-to-text#3
Estimated time: 1 hour
Actual time:
[ ] OpenPecha/image-to-text#4
Estimated time: 1 hour
Actual time:
[ ] OpenPecha/image-to-text#5
Estimated time: 1 hour
Actual time:
[ ] OpenPecha/image-to-text#6
Estimated time: 1 hour
Actual time:
[ ] OpenPecha/image-to-text#7
Estimated time: 1 hour
Actual time:
[ ] OpenPecha/image-to-text#8
Estimated time: 1 hour
Actual time:
RFC0145: Filter Norbuketaka OCR training data.
Named Concepts
Pillow: The Python Imaging Library adds image processing capabilities to your Python interpreter. This library provides extensive file format support, an efficient internal representation, and fairly powerful image processing capabilities.
Summary
This RFC details the approach for filtering OCR training data obtained from Norbuketaka. The goal is to refine the dataset by eliminating images with dimensions where the length is shorter than the height, and texts that are too short (one or two characters) or contain numbers and other non tibetan text. also reject image text that have non word tibetan text and whose corresponding image id is not present in json file.
Dependencies
1.Pillow: For image processing and dimension analysis.
Infrastructures
Design Illustrations
Justification
The chosen design focuses on accuracy and efficiency:
Testing
1.Unit Testing: To ensure each function (image dimension check, text length, and character checks) works as expected. 2.Integration Testing: To verify the complete workflow from data input to filtered output functions correctly. 3.Validation: A subset of data will be manually reviewed to ensure the filtering process meets quality standards.
Implementation Steps
List all the steps involved during implementation.
[ ] OpenPecha/image-to-text#1 Estimated time: 0.5 hour Actual time:
[ ] OpenPecha/image-to-text#1 Estimated time: 0.5 hour Actual time:
[ ] OpenPecha/image-to-text#2 Estimated time: 1 hour Actual time:
[ ] OpenPecha/image-to-text#3 Estimated time: 1 hour Actual time:
[ ] OpenPecha/image-to-text#4 Estimated time: 1 hour Actual time:
[ ] OpenPecha/image-to-text#5 Estimated time: 1 hour Actual time:
[ ] OpenPecha/image-to-text#6 Estimated time: 1 hour Actual time:
[ ] OpenPecha/image-to-text#7 Estimated time: 1 hour Actual time:
[ ] OpenPecha/image-to-text#8 Estimated time: 1 hour Actual time:
Reviewed By
@ta4tsering