OpenPecha / Requests

RFWs and RFCs for all OpenPecha repositories
0 stars 0 forks source link

[RFW0036] Automate extraction of Glyph for synthetic training data #119

Open ta4tsering opened 1 year ago

ta4tsering commented 1 year ago

Table of Contents

Housekeeping

Make sure to clearly understand Type-A and Type-B requests, and the relavant limitations. Failling to follow the guidelines pertaining to the two acceptable types of RFWs will automatically lead to disfqualification of the RFW.

Take time to complete each section below with as much detail as is required to establish a comprehensive understanding about the underlying product specification.

ALL BELOW FIELDS ARE REQUIRED

Owner

ta4tsering

Summary

Use OCR output's bounding poly value to crop the symbols from the source images to create symbols for the synthetic training data

Is This Really Necessary?

It is best way to do it since we can check the unicode value of the symbols content in the OCR output json file and see if that symbols falls between the range of Tibetan symbols unicode. We can easily collect all the symbols image by cropping it from images.

Motivation

we could collect all the symbols images of Tibetan calligraphy by cropping them from OCR source images and then filter out all the cropped images later manually once we have a lot of images to filter from.

Named Concepts

image-processing: cropping images using bounding poly s3: download source images s3 bucket json: parse OCR output json to get the symbols info

Examples

this is will automate the creation of synthetic symbols training data for the MonlamAI OCR projects, rather than humans do it manually now humans only only need to filter the images after they are cropped automatically and collected

Conceptual Design

parse OCR output to get all the symbols info, go through the list of data in the symbols and find symbols that's unicode within the desired range. use It's bounding poly, download the source image from s3, use the bounding poly value to crop the symbol from the source image and repeat it to collect as many symbols images as you can.

Drawbacks

one of the drawbacks will be that if google OCR has mistakenly predicted symbols from the source images, then we will have lot of false positive symbols images

Alternatives

Alternative will be is to let human manually create symbols synthetic data from books or pechas images, which will be a very laborious job.

New Data

If applicable, explain clearly the new data artifacts that will result from implementing this proposed work.

Adaption Window

A rough timing for the planned release for the specification possibly resulting from this request.

kaldan007 commented 1 year ago

ltgm