OCR0049:Generating Synthetic Training Data

jim-gyas commented 2 months ago

Description:

This project updates the packaging to support generating synthetic OCR training data for Tibetan script in both Tibetan Pecha and Modern Tibetan Book formats. Enhancements include updates to the line extraction script and support for varied fonts, dimensions, and random font sizes. For Tibetan Pecha pages, 40% have a background and 60% do not. Pages undergo random augmentations, and lines are extracted from both augmented and non-augmented pages. The primary goal is to create a diverse synthetic dataset for training OCR models to recognize Tibetan script more accurately.

Completion Criteria:

1) Pecha Format :

Pecha Format Dimension:

  dimensions = [
        (1123, 265),
        (794, 265),
        (1680, 402),
        (1000, 128),
        (1800, 630),
        (2864, 680),
    ]

Pecha Format font Sizes :

1) 10 ---> (0.3 probability)
2) 11 ---> (0.5 probability)
3) 12 ---> (0.2 probability)

Pecha Format font Colors:

1)Black --->(0.7 probability)
2)Red --->(0.15 probability)
3)fade Red --->(0.15 probability)

List of Augmentation to be applied:

Bad PhotoCopy
Blur
Deform
grid_distort
rotate image
Distort
Dirty Spot
Sun Flare
Torn
Random Shadow
scribble
ink bleed
Low ink periodic Line
Brightness and Contrast

Total Number of Synthetic Training Dataset generated by Pecha Format (Durtsa short Writing style) : 43,470 Dataset

1) Synthetic Page With Augmentation:
- Durtsa Font: 5 (font) * 4,347 Dataset = 21,735 Dataset
2) Synthetic Page Without Augmentation:
- Durtsa Font: 5 (font) * 4,347 Dataset = 21,735 Dataset

2) Modern Format

Modern Book Format Dimension:

 dimensions = [
        (626, 771),
        (1063, 1536),
        (259, 194),
        (349, 522),
        (974, 1500),
        (968, 1440),
    ]

Modern Book Format font Sizes :

1) 12 ---> (0.5 probability)
2) 14 ---> (0.2 probability)
3) 16 ---> (0.15 probability)
4) 30 ---> (0.15 probability)

Total Number of Synthetic Training Dataset generated by Modern Book Format (Durtsa short Writing style) : 43,470 Dataset

1) Synthetic Page With Augmentation:
- Durtsa Font: 5 (font) * 4,347 Dataset = 21,735 Dataset
2) Synthetic Page Without Augmentation:
- Durtsa Font: 5 (font) * 4,347 Dataset = 21,735 Dataset

Resources:

1)Generating Synthetic Training Data Package: https://github.com/OpenPecha/SynthImage 2)Text file: https://github.com/OpenPecha/synthetic-data-for-ocr/tree/main/texts/kangyur 3)Fonts Repo:

4)Background repo: https://github.com/OpenPecha/synthetic-data-for-ocr/tree/main/backgrounds

Implementation:

Subtask

[x] Document dimensions and characteristics for Tibetan Pecha and Modern Tibetan Book formats.
[x] Update package to support new formats and remove the page_image.py script.
[x] Generate Synthetic Pages:
- [x] Tibetan Pecha Format (pecha_format_page.py):
  - Generate pages (without augmentation).
  - Apply background with 40% probability.
  - Use red text (15% probability), faded red (15%), and black (70%).
- [x] Modern Tibetan Book Format (modern_book_format_page.py):
  - Generate pages (without augmentation) with and without background or font color changes.
  - Both the Tibetan Pecha and Modern Tibetan Book formats are now supported with new scripts (pecha_format_page.py and modern_book_format_page.py), replacing the older page_image.py script.
- [x] Apply Augmentation:
- [x] Augment pages for both formats with common and some less common techniques.
- [x] Save augmented pages.

jim-gyas commented 2 months ago

Hi @kaldan007, could you please review my card? @ta4tsering has assigned me the task of generating synthetic training data using our existing package.

jim-gyas commented 2 months ago

Top and Bottom padding = 40 left and right padding = 80 font size = 30 Number of Pages = 621

Durtsa Font Synthetic Page Image:

1) Durtsa_GologDernangSadris-Drutsa

2) Durtsa_Kangba Derchi-Drutsa

3) Durtsa_monlam_uni_dutsa1

4) Durtsa_Qomolangma-Drutsa

Quikyig Font Synthetic Page Image:

1) Quikyig_monlam_uni_chouk

2) Quikyig_Qomolangma-Chuyig

Tsugring Font Synthetic Page Image:

1) Tsugring_monlam_uni_tikrang

2) Tsugring_Qomolangma-Tsuring

Tsugthung Font Synthetic Page Image:

1) Tsugthung_monlam_uni_tiktong

2) Tsugthung_Qomolangma-Tsutong

ta4tsering commented 2 months ago

write the amount of data are you going to create ? like how many line images for each font or each augmentations. list all the numbers for each line image groups you will create. and include the line image example of each groups that you will create. so we can visualize the outputs before the code is run. And why all the page examples above are from pecha format ? why isnt there any modern page format in there ?

jim-gyas commented 2 months ago

write the amount of data are you going to create ? like how many line images for each font or each augmentations. list all the numbers for each line image groups you will create. and include the line image example of each groups that you will create. so we can visualize the outputs before the code is run. And why all the page examples above are from pecha format ? why isnt there any modern page format in there ?

Sure @ta4tsering , I’ll take care of the tasks mentioned, including the creation of line images with different fonts, augmentations, and writing styles. I'll also ensure to include modern page formats along with Pecha. This will cover print types like woodprint, handwritten, and modern print, as well as various writing styles.

jim-gyas commented 2 months ago

In modern format, some of the lines exceeding the page width

Screenshot 2024-08-14 at 6 10 23 AM

Some Other Writing Style Added

ta4tsering commented 2 months ago

one example of a paper is here https://paperswithcode.com/paper/utrnet-high-resolution-urdu-text-recognition where they have both real-data and synthetic data as well and you will lots more paper on OCR from the https://paperswithcode.com

jim-gyas commented 2 months ago

one example of a paper is here https://paperswithcode.com/paper/utrnet-high-resolution-urdu-text-recognition where they have both real-data and synthetic data as well and you will lots more paper on OCR from the https://paperswithcode.com

ok sure @ta4tsering i will look into that .

jim-gyas commented 2 months ago

Hi @eric and @devatwiai, Could you please review my card regarding the generation of synthetic training data? I would greatly appreciate any suggestions you might have on improvements or corrections. Let me know if there are any issues or if everything looks good as is.

jim-gyas commented 2 months ago

Pecha Format Augmentation Example :

Screenshot 2024-08-19 at 10 06 37 AM Screenshot 2024-08-19 at 10 04 56 AM Screenshot 2024-08-19 at 10 00 06 AM Screenshot 2024-08-19 at 9 57 53 AM Screenshot 2024-08-19 at 9 56 13 AM Screenshot 2024-08-19 at 10 02 32 AM

jim-gyas commented 2 months ago

For Dimension (1123,265) with font size 12 and number of chars_per_line 260:

For Dimension (794,265) with font size 12 number of chars_per_line 170:

devatwiai commented 2 months ago

Hi @jim-gyas,

I have a couple of questions and suggestions:

Have you tried using real-world Pecha backgrounds like the ones available here: Backgrounds?
What is your process for extracting individual lines from the generated images? Do you have an option for generating segmentation masks for lines or glyph masks?
Are you only generating images with black font? You might want to experiment with other color maps, such as red, which is sometimes present in Pechas. Faded red, in particular can be challenging for OCR models.
Overall, the results look good to me. However, I’ve noticed in some modern Tibetan images, ink from the following page is often visible on the current page. You might want to explore augmentations related to this or any other useful augmentations from the following:

jim-gyas commented 2 months ago

Kangba Derchi-Drutsa Font :

Font size (12) and Dimension (794*265):

Font size (11) and Dimension (1123*265):

Font size (11) and Dimension (794*397):

Font size (10) and Dimension (1123*397):

jim-gyas commented 2 months ago

Hi @jim-gyas,

I have a couple of questions and suggestions:

Have you tried using real-world Pecha backgrounds like the ones available here: Backgrounds?

What is your process for extracting individual lines from the generated images? Do you have an option for generating segmentation masks for lines or glyph masks?

Are you only generating images with black font? You might want to experiment with other color maps, such as red, which is sometimes present in Pechas. Faded red, in particular can be challenging for OCR models.

Overall, the results look good to me. However, I’ve noticed in some modern Tibetan images, ink from the following page is often visible on the current page. You might want to explore augmentations related to this or any other useful augmentations from the following:

Image Effects

Transforms

Text Effects

Hi @devatwiai ,

Thank you for your thoughtful review and suggestions. I appreciate the time you've taken to provide such valuable feedback.

Pecha Backgrounds: I haven't yet implemented the Pecha backgrounds from the backgrounds you provided, but I agree that incorporating real-world Pecha backgrounds is a fantastic idea. I’ll be integrating those into the next phase of development to enhance the authenticity of the generated data.
Line Extraction: Regarding line extraction, I’ve implemented a detailed process to accurately extract individual lines from the generated images. The method involves rotating the augmented image to align the text lines horizontally, creating a blank reference image to determine bounding boxes for each line, and then cropping the lines based on these bounding boxes. We ensure that only non-blank lines are extracted, and if the image was rotated, we rotate the extracted lines back to their original orientation. This approach ensures precise and clean extraction of text lines, making them ready for further analysis or processing. You can find an example of the line extraction process Extracted Lines.
Font Color: Currently, I'm generating images using black font exclusively. However, you've raised an excellent point about incorporating other color maps, particularly red and faded red, which are common in Pechas and can be challenging for OCR models. I’ll definitely explore this further to improve the diversity and robustness of the training data.
Additional Augmentations: Your suggestion about augmentations, especially regarding visible ink from following pages and the augmentations listed under Image Effects, Transforms, and Text Effects, is greatly appreciated. I’m keen on exploring these options to further refine and expand the augmentation techniques we’re currently using.

Overall, I’m grateful for your insightful feedback, and I'm excited to incorporate these suggestions to improve the quality of our synthetic data.

jim-gyas commented 2 months ago

Modern Book Format:

Kangba Derchi-Drutsa Font:

Font Size 12 and Dimension (900*1350)

Font Size 14 and Dimension (750*1200)

Font Size 30 and Dimension (1050*1500)

Font Size 14 and Dimension (1275*1650)

jim-gyas commented 2 months ago

Pecha Format:

Kangba Derchi-Drutsa Font:

Font Size:12 and Dimension (1123,265):

Font Size:12 and Dimension (1123,397):

Font Size:12 and Dimension (794,397):

Font Size:12 and Dimension (794,265):

jim-gyas commented 2 months ago

Pecha Format:

Kangba Derchi-Drutsa Font:

1) Dimension 1123*265 and font size 11 with black font color:

2) Dimension 794*265 and font size 11 with red font color:

3) Dimension 1680*402 and font size 11 with black font color:

4) Dimension 2864*680 and font size 11 with black font color:

page_101_2864x680_count_13_font11_Kangba Derchi-Drutsa

jim-gyas commented 2 months ago

Modern Book Format:

玉翅+zhuca+weizang Font:

1) Dimension 626*771 and Font Size 16:

2) Dimension 349*522 and Font Size 12:

3) Dimension 968*1440 and font size 16:

jim-gyas commented 2 months ago

Modern Book Extracted Line Dataset

1) Kangba Derchi-Drutsa(Drutsa Short)

Synthetic Page Image (without augmentation) (Dimension (626*771) and font size 14):

626x771_modern_format_page_image

Extracted Lines:

line_0

line_1

line_2

line_3

line_4

2) monlam_uni_tiktong (Tsugthung) (Dimension (626*771) and font size 14):

Synthetic Page Image (without augmentation):

626x771_modern_format_page_image

Extracted Lines:

line_0

line_1

line_2

line_3

line_4

jim-gyas commented 2 months ago

Pecha Format Extracted Line Dataset:

1) Kangba Derchi-Drutsa(Drutsa Short):

Synthetic Page Image (without augmentation)(Dimension (794*265) and font size 10):

1123x265_pecha_format_page_image

Extracted Lines:

line_0

line_1

line_2

line_3

line_4

2) monlam_uni_tiktong (Tsugthung):

Synthetic Page Image (without augmentation)(Dimension (1123*265) and font size 10):

1123x265_pecha_format_page_image

Extracted Lines:

line_0

line_1

line_2

line_3

line_4

jim-gyas commented 2 months ago

Hello @devatwiai , I wanted to clarify that I previously referenced the wrong repository. The examples of extracted lines for both the modern book synthetic dataset and the Pecha synthetic dataset are provided in the above 2 comments. The line extraction process for these datasets differs from the previous method.

The line extraction process involves the following steps:

Convert to Grayscale: The page image is converted to grayscale.
Apply Threshold: A threshold is applied to make the text white and the background black.
Find Contours: The contours of the text lines are detected and sorted from top to bottom.
Create Bounding Boxes: For each line, a bounding box is created, and an image region is extracted with additional space above and below the line to achieve a more realistic appearance.
Resize and Collect: The extracted line image is resized to match the page width and added to a list.

This process is repeated for all lines, resulting in a collection of line images extracted from the page.

jim-gyas commented 2 months ago

Pecha Format With Background:

玉翅+zhuca+weizang Font without Augmentation:

page_50_1800x630_count_9_font10_玉翅+zhuca+weizang

page_3_1123x265_count_2_font10_玉翅+zhuca+weizang

玉翅+zhuca+weizang Font with Augmentation:

page_50_1800x630_count_9_font10_玉翅+zhuca+weizang_ScribbleAugmentation_BrightnessAugmentation(factor=1 07)

page_3_1123x265_count_2_font10_玉翅+zhuca+weizang_TornAugmentation(num_tears=5, tear_size=48, jagged_step=8, jagged_variability=5)

jim-gyas commented 2 months ago

Pecha Format with line Spacing : 5

1123x265_pecha_format_page_image

Line Extraction with line Spacing : 5

line_0

line_1

line_2

Pecha Format with line Spacing : 8

1123x265_pecha_format_page_image

Line Extraction with line Spacing : 8

line_0

line_1

line_2

@ta4tsering,I’ve attached several variations of line spacing as you suggested in previous comments. In the earlier versions, the gap between lines seemed quite large and unrealistic. From my perspective, a line spacing of 8 appears to be more realistic. Could you please review the attached samples and let me know which line spacing you think looks the best?

devatwiai commented 2 months ago

Hello @devatwiai , I wanted to clarify that I previously referenced the wrong repository. The examples of extracted lines for both the modern book synthetic dataset and the Pecha synthetic dataset are provided in the above 2 comments. The line extraction process for these datasets differs from the previous method.

The line extraction process involves the following steps:

Convert to Grayscale: The page image is converted to grayscale.

Apply Threshold: A threshold is applied to make the text white and the background black.

Find Contours: The contours of the text lines are detected and sorted from top to bottom.

Create Bounding Boxes: For each line, a bounding box is created, and an image region is extracted with additional space above and below the line to achieve a more realistic appearance.

Resize and Collect: The extracted line image is resized to match the page width and added to a list.

This process is repeated for all lines, resulting in a collection of line images extracted from the page.

@jim-gyas Line extraction looks fine to me on these images for the purpose of OCR training. Since, we are already mentioning spacing while generating the whole page, it would be good if we use that information here as well to decide the space above and below the line before extracting.

jim-gyas commented 2 months ago

@jim-gyas Line extraction looks fine to me on these images for the purpose of OCR training. Since, we are already mentioning spacing while generating the whole page, it would be good if we use that information here as well to decide the space above and below the line before extracting.

@devatwiai Okay, I will look into that approach to extract lines. Thanks for the suggestion!

devatwiai commented 2 months ago

Pecha Format With Background:

玉翅+zhuca+weizang Font without Augmentation:

玉翅+zhuca+weizang Font with Augmentation:

In these images with different backgrounds, I noticed a white boundary/noise around the rendered text. What could be causing this?

jim-gyas commented 2 months ago

Pecha Format With Background:

玉翅+zhuca+weizang Font without Augmentation:

玉翅+zhuca+weizang Font with Augmentation:

In these images with different backgrounds, I noticed a white boundary/noise around the rendered text. What could be causing this?

Actually, @devatwiai, the white boundary or noise around the rendered text is part of the background image itself. This background was included in the repository you provided earlier. Here’s the background image I used for reference: I will try to remove the noise from the background and update you on the progress.

devatwiai commented 2 months ago

Pecha Format With Background:

玉翅+zhuca+weizang Font without Augmentation:

玉翅+zhuca+weizang Font with Augmentation:

In these images with different backgrounds, I noticed a white boundary/noise around the rendered text. What could be causing this?

Actually, @devatwiai, the white boundary or noise around the rendered text is part of the background image itself. This background was included in the repository you provided earlier. Here’s the background image I used for reference: I will try to remove the noise from the background and update you on the progress.

No, not about background noise but rather artifact within the text itself. If you closely look the text, you'll notice it includes not only black but also faded white color as well. It wasn't there in older images maybe wasn't visible because of the white background?

jim-gyas commented 2 months ago

Pecha Format With Background:

玉翅+zhuca+weizang Font without Augmentation:

玉翅+zhuca+weizang Font with Augmentation:

In these images with different backgrounds, I noticed a white boundary/noise around the rendered text. What could be causing this?

Actually, @devatwiai, the white boundary or noise around the rendered text is part of the background image itself. This background was included in the repository you provided earlier. Here’s the background image I used for reference: I will try to remove the noise from the background and update you on the progress.

No, not about background noise but rather artifact within the text itself. If you closely look the text, you'll notice it includes not only black but also faded white color as well. It wasn't there in older images maybe wasn't visible because of the white background?

Thanks for pointing that out, @devatwiai. It seems the issue is with the text itself, which contains some faded white areas in addition to the black. The artifacts within the text are likely due to the way the text rendering and masking are being processed in the script. I'll check the rendering process and the use of thresholds to see if we can clean up the text output better. Thanks for bringing this to my attention!

OpenPecha / SynthImage