OCR0034: Further processing for cropped images

10kalden commented 4 months ago

Description:

After cropping the images of Pechas based on the line mapping of a Tibetan character, we need to process the images further to make it easier for the annotators to annotate the glyphs.
We can achieve this by cropping the image width to accommodate the required Tibetan character, this process will be done by image selection annotators.
The idea is to parse all the OPF text transcription, extract the mapping and line-crop 10 images for each required Tibetan character.
If multiple characters are on the same line, it will also count as a multiple image.
these10 image widths will be cropped to accommodate the required glyphs.
if in some images, the char is not present, the script runs again to get the required images to based on the mapping.

Reference:

img.1: Line cropped image for the character ཡུ

for context, the text will occur on the middle line

Subtask:

[x] update the mapping format of characters in the images
[x] use toolkit script to download all the required OPF repos from GitHub
[x] update to script to look for 10 occurrence for each character and download the image from s3
[x] explore ways to improve the tesseract OCR output (if possible)
[x] upload to github for image width cropping

Completion Criteria To have cropped images ready for annotation

ta4tsering commented 4 months ago

mapping of the cropped lines done, now will use the toolkit to download and run the script to crop the missing glyphs presented lines.

10kalden commented 4 months ago

Updated the script for the s3 download key creations script, as there is some difference in mapping for the image group or volume id, some OPF doesn't have mapping for the image group ID in the meta.yml

ta4tsering commented 4 months ago

which OPF, provide the name of the OPF here and then. There could be more than 1 opf for a single work id so that means you can look for opf of the same work with image_group id in the meta.yml

10kalden commented 4 months ago

https://github.com/OpenPecha-Data/P000800 this is the opf which doest have the mapping in the meta.yml

ta4tsering commented 4 months ago

you have to look into the files to explore it, for example if you look into the pagination.yml you will find the image_group_id in the reference. you need to compare it with the image_group_id from the bdrc websites as well. bdrc v001 and in here you can see that the image_group_id is I1317. In the reference of first page 13170001 so that means our image_group_id is I1317

10kalden commented 4 months ago

@ta4tsering OK yeah, that's correct.. thank you for that

ta4tsering commented 4 months ago

will download the images and crop the lines, upload the cropped lines to the s3 and create the jsonl required for the prodigy format. on the EC2 server.

kaldan007 commented 4 months ago

Will be working on extracting more data from Google OCR output. @10kalden please check release asset of google OCRed OPFs

OpenPecha / extract-missing-glyphs

OCR0034: Further processing for cropped images #5