Open 10kalden opened 4 months ago
Facing some issue with the logic behind indices, need to explore on it and figure it out
Doing some test on finding the indices of a character, as it is not matching with the desired images
-gathered all the required image names in a JSON and modified the script to download only these images from BDRC
Tesseract output is not catching all the char in the images, because of that, it isn't easy to crop images based on indices. the solution for this is to determine on which line each char occurs and crop that line accordingly
this the output of Tesseract OCR, in this, I am only getting around 400 characters per image, but in the Google OCR output for the same pictures, there are about 1000 char, due to this discrepancy, cropping glyphs using indices is not going to work
-this is a json output representing the character and its position in the images, in the reference mapping, the list is value representing the index and the line in which the character occurs. -this will be used to crop the entire line out of images to extract the required glyphs
orginal image
this is a cropped image for the character ཡུ
we can also have the cropped image in this size for the character ཡུ
updated mapping for the Tibetan char
Description:
Subtask:
Completion Criteria: To obtain cropped glyphs of the required Tibetan characters.