OCR0037: Extracting missing Tibetan glyphs & ligatures

10kalden commented 5 months ago

Description:

To create a new Tibetan font, around 1044 essential glyphs need to be added to the font, The glyphs and ligatures extraction from works has already been done and around 700 glyphs have been extracted to be used in the fonts.
Some glyphs are missing mainly Tibetan superscripts, subscripts and complex ligatures.
These missing were not obtained when the works were applied to Google OCR.
To obtain the missing glyphs another approach has to be taken.

Implementation plan:

Sub-task:

[x] explore the OPF folders
[x] Select all the ligatures with the required superscript and subscript to be cropped
[x] Upload the subjoined images to s3 and create JSONL
[x] Explore alternate procedures to extract ligature which were not caught by google OCR
[x] Write a script to extract the ligatures from the transcribed text

Completion Criteria:

To obtain all the missing glyphs and ligatures

10kalden commented 5 months ago

To extract Tibetan subjoined letters, I am writing a script to parse all the Tibetan ligatures we have found to check for subjoined letters present in the ligatures. the ligature with the subjoined letter found will be uploaded to s3 and a JSONL file will be created with all the metadata to be loaded into Prodigy for annotation.

10kalden commented 5 months ago

To extract the complex ligatures that google OCR missed, I am using the transcribed text of work ID W2KG209989 (derge tengyur) in OPF. The script will parse the text to find the ligature's image number and span and use that to extract the glyphs.

OpenPecha / extract-missing-glyphs

OCR0037: Extracting missing Tibetan glyphs & ligatures #1