OpenPecha / extract-missing-glyphs

MIT License
0 stars 0 forks source link

OCR0031: Extracting glyphs by indexing Tibetan characters using OPF data #3

Open 10kalden opened 4 months ago

10kalden commented 4 months ago

Description:

Subtask:

Completion Criteria: To obtain cropped glyphs of the required Tibetan characters.

10kalden commented 4 months ago

Facing some issue with the logic behind indices, need to explore on it and figure it out

10kalden commented 4 months ago

Doing some test on finding the indices of a character, as it is not matching with the desired images

10kalden commented 4 months ago

Image -gathered all the required image names in a JSON and modified the script to download only these images from BDRC

10kalden commented 4 months ago

Tesseract output is not catching all the char in the images, because of that, it isn't easy to crop images based on indices. the solution for this is to determine on which line each char occurs and crop that line accordingly

Image

this the output of Tesseract OCR, in this, I am only getting around 400 characters per image, but in the Google OCR output for the same pictures, there are about 1000 char, due to this discrepancy, cropping glyphs using indices is not going to work

10kalden commented 4 months ago

Image

-this is a json output representing the character and its position in the images, in the reference mapping, the list is value representing the index and the line in which the character occurs. -this will be used to crop the entire line out of images to extract the required glyphs

10kalden commented 4 months ago

orginal image

Image

10kalden commented 4 months ago

Image

this is a cropped image for the character ཡུ

10kalden commented 4 months ago

we can also have the cropped image in this size for the character ཡུ

Image

10kalden commented 4 months ago

Image

updated mapping for the Tibetan char