Open 10kalden opened 5 months ago
To extract Tibetan subjoined letters, I am writing a script to parse all the Tibetan ligatures we have found to check for subjoined letters present in the ligatures. the ligature with the subjoined letter found will be uploaded to s3 and a JSONL file will be created with all the metadata to be loaded into Prodigy for annotation.
To extract the complex ligatures that google OCR missed, I am using the transcribed text of work ID W2KG209989 (derge tengyur) in OPF. The script will parse the text to find the ligature's image number and span and use that to extract the glyphs.
Description:
Implementation plan:
Sub-task:
Completion Criteria: