OCR0031: Extracting glyphs by indexing Tibetan characters using OPF data

10kalden commented 4 months ago

Description:

To extract tibetan ligatures that were missed by google OCR. In order to accurately obtain all the glyphs, we need to use the text trancription of the work which is provided in OPF with annotation layer.
The trancription will be parsed against an required character list to get the character occurance and its index or span in the annotation layer of pagination.
From the pagination layer, we obtain the images in which the character occured.
Apply tesseract OCR to the image to obtain the indices and match it against the character index in the pagination to crop the glyph out of the image.

Subtask:

[x] write a script to parse the txt file and source the images where the character occured
[x] write a script to apply OCR to the script to get the char indices.
[x] write a script to crop the images by taking into consideration all the edge cases.
[x] write a script to crop the line

Completion Criteria: To obtain cropped glyphs of the required Tibetan characters.

10kalden commented 4 months ago

Facing some issue with the logic behind indices, need to explore on it and figure it out

10kalden commented 4 months ago

Doing some test on finding the indices of a character, as it is not matching with the desired images

10kalden commented 4 months ago

-gathered all the required image names in a JSON and modified the script to download only these images from BDRC

10kalden commented 4 months ago

Tesseract output is not catching all the char in the images, because of that, it isn't easy to crop images based on indices. the solution for this is to determine on which line each char occurs and crop that line accordingly

this the output of Tesseract OCR, in this, I am only getting around 400 characters per image, but in the Google OCR output for the same pictures, there are about 1000 char, due to this discrepancy, cropping glyphs using indices is not going to work

10kalden commented 4 months ago

-this is a json output representing the character and its position in the images, in the reference mapping, the list is value representing the index and the line in which the character occurs. -this will be used to crop the entire line out of images to extract the required glyphs

10kalden commented 4 months ago

orginal image

10kalden commented 4 months ago

this is a cropped image for the character ཡུ

10kalden commented 4 months ago

we can also have the cropped image in this size for the character ཡུ

10kalden commented 4 months ago

updated mapping for the Tibetan char

OpenPecha / extract-missing-glyphs

OCR0031: Extracting glyphs by indexing Tibetan characters using OPF data #3