RFC150: Create OCR training data by doing annotation transfer

Named Concepts

OCR stands for "Optical Character Recognition." It is a technology that recognizes text within a digital image.

Summary

We are going to create a script that will transfer the line annotations from an OCRed manuscripts text to cleaned text of that scripture, which will be then mapped to the page image to its page text from the cleaned text transfer annotation. and then we will crop out the line images from the pages with the xml files , which will then be mapped with its corresponding line text.

In case there is no xml files with proper line image co-ordinates, then we will use out line detection model to get the prediction co-ordinates.

Dependencies

antx hugging face s3 bucket

Infrastructures

google vision api

Design Illustrations

Implementation without line image co-ordinates

Implementation with line image co-ordinates

Justification

the alternative would be giving out the line images to be transcribed by the annotators, that is more expensive with money and time as well. this way we can create lot of training data for various Manuscripts with the same text that are already cleaned.

Testing

Will test on first page of volume 3 of Nyigma Gybum.

Implementation Steps

[x] OpenPecha/ocr-ann-transfer#1 Estimated time: 1 hr Actual time:
[x] OpenPecha/ocr-ann-transfer#2 Estimated time: 4 hr Actual time: 4 hr
[ ] OpenPecha/ocr-ann-transfer#3 Estimated time: 5 hr Actual time:
[ ] OpenPecha/ocr-ann-transfer#4 Estimated time: 6 hr Actual time:
[ ] OpenPecha/ocr-ann-transfer#5 Estimated time: 6 hr Actual time:
[ ] OpenPecha/ocr-ann-transfer#6 Estimated time: 6 hr Actual time:
[ ] OpenPecha/ocr-ann-transfer#7 Estimated time: 3 hr Actual time:

Reviewed By

@kaldan007

OpenPecha / Requests