OpenPecha / extract-missing-glyphs

MIT License
0 stars 0 forks source link

OCR0030: Extracting missing Tibetan glyphs & ligatures #1

Open 10kalden opened 3 weeks ago

10kalden commented 3 weeks ago

Description:

Implementation plan:

Image

Sub-task:

Completion Criteria:

10kalden commented 2 weeks ago

To extract Tibetan subjoined letters, I am writing a script to parse all the Tibetan ligatures we have found to check for subjoined letters present in the ligatures. the ligature with the subjoined letter found will be uploaded to s3 and a JSONL file will be created with all the metadata to be loaded into Prodigy for annotation.

10kalden commented 2 weeks ago

To extract the complex ligatures that google OCR missed, I am using the transcribed text of work ID W2KG209989 (derge tengyur) in OPF. The script will parse the text to find the ligature's image number and span and use that to extract the glyphs.