OpenPecha / Requests

RFWs and RFCs for all OpenPecha repositories
0 stars 0 forks source link

RFW0145: Filter Norbuketaka OCR training data #401

Open ta4tsering opened 5 months ago

ta4tsering commented 5 months ago

RFW0145: Filter Norbuketaka OCR training data.

Summary

We have OCR training data of line images and text pairs from Norbuketaka that needs to be filtered.

Key Concepts

Pillow: The Python Imaging Library adds image processing capabilities to your Python interpreter.This library provides extensive file format support, an efficient internal representation, and fairly powerful image processing capabilities.

Context

Norbuketaka project team has already proofread about 500 works and then we have all those works google OCR output and images. We got line images and its corresponding text from Sina, where he used the above mentioned proofread text, work images and work google OCR output. But we still need to filter out all the images that has length shorter than its heights or text that has only one or two characters or text with numbers in it.

Outputs

Filtered out usable OCR training data

Inputs

line images and json file google drive

Timeline

Specify the expected delivery date for the project.

References

Include any relevant links or resources for additional context or information.