OCR0035: Creating synthetic data

jim-gyas commented 2 months ago

Description:

Develop a Python package to generate synthetic Tibetan text images using specified fonts (excluding Uchen and Betsuk). Users can select their preferred augmentations from a predefined list. The package will generate page images, apply the selected augmentations, and then extract and save line images from the augmented pages. The main goal is to provide a tool for creating diverse, realistic line images to enhance the OCR dataset.

Completion Criteria:

Script Creation:

page_image.py: Generates synthetic Tibetan text images (excluding Uchen and Betsuk).
augmentation.py: Applies augmentations to the synthetic images.
extraction_line.py: Extracts lines from augmented images.
Functionality: Scripts function as expected, producing synthetic images, applying augmentations, and extracting lines.

Font Resources:

1)Font Repository: https://github.com/OpenPecha/tibetan-fonts 2)Google Drive (Given by Eric)

Implementation plan:

Subtask:

[x] Create page_image.py:
- [x] Implement functionality to generate synthetic page images.

[x] Create augmentation.py:
- [x] Implement functionality to apply various augmentations to the synthetic page images generated by 'page_image.py'. The list of augmentations includes:
  - [x] brightness
  - [x] contrast
  - [x] sharpness
  - [x] rotate
  - [x] distort
  - [x] deform
  - [x] torn
  - [x] dirty
[x] Create extraction_line.py:
- [x] Implement functionality to extract lines from the augmented page images processed by augmentation.py.

ta4tsering commented 2 months ago

will explore the existing script and then look into possible image augmentation, image dimenstions. update the card. @kaldan007 will direct you towards the etext source.

jim-gyas commented 2 months ago

@ta4tsering and @kaldan007 , could you please provide a sample text file to use for generating synthetic data? Also, I'd like to know your preferences for fonts and any specific image requirements.

ta4tsering commented 2 months ago

first do the card, make it robust. write the possible augmentation that needs to be done to make it look like a line image from a real book or pecha or handwritten that means introducing noise in the images or bending the image or making the image a blur. Also look into what are the types of dimension of images that means to be created. Finally how are you going to use the fonts that are available. for example we already have a large Uchan datasets so that means prioritizing the others.

you can refer to the existing OCR data are on the following url

jim-gyas commented 2 months ago

Sounds good! I'll start by using noise, bends, blur, and varied backgrounds to simulate realistic line images. Gathering sample text files from Norbuketaka, WoodBlocks, and Google Books datasets now.

kaldan007 commented 2 months ago

Kindly study on which writing style to generate sysnthetic data and go through eric script and see if any extra augmentation is required

ta4tsering commented 2 months ago

Eric said this Uh well, I did that already quite a while ago for dbu-can and all betsug fonts I think, so prioritize other fonts apart from Betsuk and Uchan. And your card needs more clarity and then put the resources like fonts repo and all in the card. And then create the flow chart and all

jim-gyas commented 2 months ago

ok sure, I'll prioritize fonts other than dbu-can and betsug. I'll also clarify the card with resources like the fonts repo and create a detailed flow chart.

ta4tsering commented 2 months ago

@eric86y you can comment right here.

jim-gyas commented 2 months ago

@ta4tsering and @kaldan007, @eric86y suggested an alternative approach for generating realistic samples. Instead of focusing on individual lines, we could create entire synthetic pages, apply page-level augmentations to introduce realistic imperfections, and then extract the lines from these augmented pages for use as training samples. This method addresses the issue of synthetic data being too clean and regular.

kaldan007 commented 2 months ago

Going to pursue eric suggestion

kaldan007 commented 2 months ago

fixing the augmentation bug. Pagewise needs to be converted in line image text pair step needs to be included in implementation step

kaldan007 commented 2 months ago

Will be augmenting image in random way. all the augmented images will have different augmentation. need to figure out how to extract line images.

kaldan007 commented 2 months ago

Will be solving the issue of missing text by tuning the padding.

jim-gyas commented 2 months ago

Applying Gaussian Blur Noise to Straight Synthetic Page

Extracting Line After Augmentation

jim-gyas commented 2 months ago

Bending Synthetic Page

Line Extraction

kaldan007 commented 1 month ago

will be adding more agumentation.

jim-gyas commented 1 month ago

Padding the bending page image.

Straight Page Augmentation

jim-gyas commented 1 month ago

@kaldan007 done with the modification of card.

jim-gyas commented 1 month ago

Synthetic Page Generation

Augmentation (Deformation)

Augmentation (Background)

ta4tsering commented 1 month ago

done with the basics of augmentation on the page level, now he will write the scripts to get the line image from the page images.

jim-gyas commented 1 month ago

Augmented Page (Deform)

Extracted Line from Augmented Page

Augmented Page (Bend)

Extracted Line

jim-gyas commented 1 month ago

Dirty Spot Augmented page

Extraction Line

Torn Augmented page

kaldan007 commented 1 month ago

@jim-gyas please prepare sample of all the page and line images that u have created via augmentations and save it in a google drive. share that drive folder link in ocr channel and tag eric and devesh to get their feedback.

jim-gyas commented 1 month ago

Augmented Page(Torn)

Extracted Line

Augmented Page(Tear)

Extracted Line

jim-gyas commented 1 month ago

@kaldan007, I’ve updated the card related to our synthetic line image. Could you please review the changes and let me know if there are any further updates or improvements needed?

kaldan007 commented 1 month ago

@jim-gyas please update the card with the list of augmentation u are planning to implement.

OpenPecha / SynthImage