OpenPecha / SynthImage

MIT License
0 stars 0 forks source link

OCR0035: Creating synthetic data #1

Open jim-gyas opened 4 months ago

jim-gyas commented 4 months ago

Description:

Develop a Python package to generate synthetic Tibetan text images using specified fonts (excluding Uchen and Betsuk). Users can select their preferred augmentations from a predefined list. The package will generate page images, apply the selected augmentations, and then extract and save line images from the augmented pages. The main goal is to provide a tool for creating diverse, realistic line images to enhance the OCR dataset.

Completion Criteria:

Script Creation:

Font Resources:

1)Font Repository: https://github.com/OpenPecha/tibetan-fonts 2)Google Drive (Given by Eric)

Implementation plan:

Image

Subtask:

ta4tsering commented 4 months ago

will explore the existing script and then look into possible image augmentation, image dimenstions. update the card. @kaldan007 will direct you towards the etext source.

jim-gyas commented 4 months ago

@ta4tsering and @kaldan007 , could you please provide a sample text file to use for generating synthetic data? Also, I'd like to know your preferences for fonts and any specific image requirements.

ta4tsering commented 4 months ago

first do the card, make it robust. write the possible augmentation that needs to be done to make it look like a line image from a real book or pecha or handwritten that means introducing noise in the images or bending the image or making the image a blur. Also look into what are the types of dimension of images that means to be created. Finally how are you going to use the fonts that are available. for example we already have a large Uchan datasets so that means prioritizing the others.

you can refer to the existing OCR data are on the following url

  1. Norbuketaka datasets
  2. WoodBlocks datasets
  3. Google Books datasets
jim-gyas commented 4 months ago

Sounds good! I'll start by using noise, bends, blur, and varied backgrounds to simulate realistic line images. Gathering sample text files from Norbuketaka, WoodBlocks, and Google Books datasets now.

kaldan007 commented 4 months ago

Kindly study on which writing style to generate sysnthetic data and go through eric script and see if any extra augmentation is required

ta4tsering commented 4 months ago

Eric said this Uh well, I did that already quite a while ago for dbu-can and all betsug fonts I think, so prioritize other fonts apart from Betsuk and Uchan. And your card needs more clarity and then put the resources like fonts repo and all in the card. And then create the flow chart and all

jim-gyas commented 4 months ago

ok sure, I'll prioritize fonts other than dbu-can and betsug. I'll also clarify the card with resources like the fonts repo and create a detailed flow chart.

ta4tsering commented 4 months ago

@eric86y you can comment right here.

jim-gyas commented 4 months ago

@ta4tsering and @kaldan007, @eric86y suggested an alternative approach for generating realistic samples. Instead of focusing on individual lines, we could create entire synthetic pages, apply page-level augmentations to introduce realistic imperfections, and then extract the lines from these augmented pages for use as training samples. This method addresses the issue of synthetic data being too clean and regular.

kaldan007 commented 4 months ago

Going to pursue eric suggestion

kaldan007 commented 4 months ago

fixing the augmentation bug. Pagewise needs to be converted in line image text pair step needs to be included in implementation step

kaldan007 commented 4 months ago

Will be augmenting image in random way. all the augmented images will have different augmentation. need to figure out how to extract line images.

kaldan007 commented 4 months ago

Will be solving the issue of missing text by tuning the padding.

jim-gyas commented 4 months ago

Applying Gaussian Blur Noise to Straight Synthetic Page

Image

Extracting Line After Augmentation

Image

jim-gyas commented 4 months ago

Bending Synthetic Page

Image

Line Extraction

Image

kaldan007 commented 4 months ago

will be adding more agumentation.

jim-gyas commented 4 months ago

Padding the bending page image.

Image

Straight Page Augmentation

Image

jim-gyas commented 4 months ago

@kaldan007 done with the modification of card.

jim-gyas commented 4 months ago

Synthetic Page Generation

Image

Augmentation (Deformation)

Image

Augmentation (Background)

Image

ta4tsering commented 4 months ago

done with the basics of augmentation on the page level, now he will write the scripts to get the line image from the page images.

jim-gyas commented 4 months ago

Augmented Page (Deform)

Image

Extracted Line from Augmented Page

Image

Image

Augmented Page (Bend)

Image

Extracted Line

Image

jim-gyas commented 4 months ago

Dirty Spot Augmented page

Image

Image

Extraction Line

Image

Torn Augmented page

Image

kaldan007 commented 4 months ago

@jim-gyas please prepare sample of all the page and line images that u have created via augmentations and save it in a google drive. share that drive folder link in ocr channel and tag eric and devesh to get their feedback.

jim-gyas commented 3 months ago

Augmented Page(Torn)

Image

Extracted Line

Image

Image

Augmented Page(Tear)

Image

Extracted Line

Image

Image

jim-gyas commented 3 months ago

@kaldan007, I’ve updated the card related to our synthetic line image. Could you please review the changes and let me know if there are any further updates or improvements needed?

kaldan007 commented 3 months ago

@jim-gyas please update the card with the list of augmentation u are planning to implement.