RFC150: Create OCR training data by doing annotation transfer
Named Concepts
OCR stands for "Optical Character Recognition." It is a technology that recognizes text within a digital image.
Summary
We are going to create a script that will transfer the line annotations from an OCRed manuscripts text to cleaned text of that scripture, which will be then mapped to the page image to its page text from the cleaned text transfer annotation. and then we will crop out the line images from the pages with the xml files , which will then be mapped with its corresponding line text.
In case there is no xml files with proper line image co-ordinates, then we will use out line detection model to get the prediction co-ordinates.
Dependencies
antx
hugging face
s3 bucket
Infrastructures
google vision api
Design Illustrations
Implementation without line image co-ordinates
Implementation with line image co-ordinates
Justification
the alternative would be giving out the line images to be transcribed by the annotators, that is more expensive with money and time as well. this way we can create lot of training data for various Manuscripts with the same text that are already cleaned.
Testing
Will test on first page of volume 3 of Nyigma Gybum.
Implementation Steps
[x] OpenPecha/ocr-ann-transfer#1
Estimated time: 1 hr
Actual time:
[x] OpenPecha/ocr-ann-transfer#2
Estimated time: 4 hr
Actual time: 4 hr
[ ] OpenPecha/ocr-ann-transfer#3
Estimated time: 5 hr
Actual time:
[ ] OpenPecha/ocr-ann-transfer#4
Estimated time: 6 hr
Actual time:
[ ] OpenPecha/ocr-ann-transfer#5
Estimated time: 6 hr
Actual time:
[ ] OpenPecha/ocr-ann-transfer#6
Estimated time: 6 hr
Actual time:
[ ] OpenPecha/ocr-ann-transfer#7
Estimated time: 3 hr
Actual time:
RFC150: Create OCR training data by doing annotation transfer
Named Concepts
OCR stands for "Optical Character Recognition." It is a technology that recognizes text within a digital image.
Summary
We are going to create a script that will transfer the line annotations from an OCRed manuscripts text to cleaned text of that scripture, which will be then mapped to the page image to its page text from the cleaned text transfer annotation. and then we will crop out the line images from the pages with the xml files , which will then be mapped with its corresponding line text.
In case there is no xml files with proper line image co-ordinates, then we will use out line detection model to get the prediction co-ordinates.
Dependencies
antx hugging face s3 bucket
Infrastructures
google vision api
Design Illustrations
Implementation without line image co-ordinates
Implementation with line image co-ordinates
Justification
the alternative would be giving out the line images to be transcribed by the annotators, that is more expensive with money and time as well. this way we can create lot of training data for various Manuscripts with the same text that are already cleaned.
Testing
Will test on first page of volume 3 of Nyigma Gybum.
Implementation Steps
Reviewed By
@kaldan007