Open jim-gyas opened 4 months ago
will explore the existing script and then look into possible image augmentation, image dimenstions. update the card. @kaldan007 will direct you towards the etext source.
@ta4tsering and @kaldan007 , could you please provide a sample text file to use for generating synthetic data? Also, I'd like to know your preferences for fonts and any specific image requirements.
first do the card, make it robust. write the possible augmentation that needs to be done to make it look like a line image from a real book or pecha or handwritten that means introducing noise in the images or bending the image or making the image a blur. Also look into what are the types of dimension of images that means to be created. Finally how are you going to use the fonts that are available. for example we already have a large Uchan datasets so that means prioritizing the others.
you can refer to the existing OCR data are on the following url
Sounds good! I'll start by using noise, bends, blur, and varied backgrounds to simulate realistic line images. Gathering sample text files from Norbuketaka, WoodBlocks, and Google Books datasets now.
Kindly study on which writing style to generate sysnthetic data and go through eric script and see if any extra augmentation is required
Eric said this Uh well, I did that already quite a while ago for dbu-can and all betsug fonts I think
, so prioritize other fonts apart from Betsuk and Uchan.
And your card needs more clarity and then put the resources like fonts repo and all in the card. And then create the flow chart and all
ok sure, I'll prioritize fonts other than dbu-can and betsug. I'll also clarify the card with resources like the fonts repo and create a detailed flow chart.
@eric86y you can comment right here.
@ta4tsering and @kaldan007, @eric86y suggested an alternative approach for generating realistic samples. Instead of focusing on individual lines, we could create entire synthetic pages, apply page-level augmentations to introduce realistic imperfections, and then extract the lines from these augmented pages for use as training samples. This method addresses the issue of synthetic data being too clean and regular.
Going to pursue eric suggestion
fixing the augmentation bug. Pagewise needs to be converted in line image text pair step needs to be included in implementation step
Will be augmenting image in random way. all the augmented images will have different augmentation. need to figure out how to extract line images.
Will be solving the issue of missing text by tuning the padding.
will be adding more agumentation.
@kaldan007 done with the modification of card.
done with the basics of augmentation on the page level, now he will write the scripts to get the line image from the page images.
@jim-gyas please prepare sample of all the page and line images that u have created via augmentations and save it in a google drive. share that drive folder link in ocr channel and tag eric and devesh to get their feedback.
@kaldan007, I’ve updated the card related to our synthetic line image. Could you please review the changes and let me know if there are any further updates or improvements needed?
@jim-gyas please update the card with the list of augmentation u are planning to implement.
Description:
Develop a Python package to generate synthetic Tibetan text images using specified fonts (excluding Uchen and Betsuk). Users can select their preferred augmentations from a predefined list. The package will generate page images, apply the selected augmentations, and then extract and save line images from the augmented pages. The main goal is to provide a tool for creating diverse, realistic line images to enhance the OCR dataset.
Completion Criteria:
Script Creation:
Functionality: Scripts function as expected, producing synthetic images, applying augmentations, and extracting lines.
Font Resources:
1)Font Repository: https://github.com/OpenPecha/tibetan-fonts 2)Google Drive (Given by Eric)
Implementation plan:
Subtask:
[x] Create page_image.py:
[x] Create augmentation.py:
[x] Create extraction_line.py: