Open jim-gyas opened 2 months ago
Hi @kaldan007, could you please review my card? @ta4tsering has assigned me the task of generating synthetic training data using our existing package.
Top and Bottom padding = 40 left and right padding = 80 font size = 30 Number of Pages = 621
write the amount of data are you going to create ? like how many line images for each font or each augmentations. list all the numbers for each line image groups you will create. and include the line image example of each groups that you will create. so we can visualize the outputs before the code is run. And why all the page examples above are from pecha format ? why isnt there any modern page format in there ?
write the amount of data are you going to create ? like how many line images for each font or each augmentations. list all the numbers for each line image groups you will create. and include the line image example of each groups that you will create. so we can visualize the outputs before the code is run. And why all the page examples above are from pecha format ? why isnt there any modern page format in there ?
Sure @ta4tsering , I’ll take care of the tasks mentioned, including the creation of line images with different fonts, augmentations, and writing styles. I'll also ensure to include modern page formats along with Pecha. This will cover print types like woodprint, handwritten, and modern print, as well as various writing styles.
one example of a paper is here https://paperswithcode.com/paper/utrnet-high-resolution-urdu-text-recognition
where they have both real-data and synthetic data as well and you will lots more paper on OCR from the https://paperswithcode.com
one example of a paper is here
https://paperswithcode.com/paper/utrnet-high-resolution-urdu-text-recognition
where they have both real-data and synthetic data as well and you will lots more paper on OCR from thehttps://paperswithcode.com
ok sure @ta4tsering i will look into that .
Hi @eric and @devatwiai, Could you please review my card regarding the generation of synthetic training data? I would greatly appreciate any suggestions you might have on improvements or corrections. Let me know if there are any issues or if everything looks good as is.
Hi @jim-gyas,
I have a couple of questions and suggestions:
Hi @jim-gyas,
I have a couple of questions and suggestions:
- Have you tried using real-world Pecha backgrounds like the ones available here: Backgrounds?
- What is your process for extracting individual lines from the generated images? Do you have an option for generating segmentation masks for lines or glyph masks?
- Are you only generating images with black font? You might want to experiment with other color maps, such as red, which is sometimes present in Pechas. Faded red, in particular can be challenging for OCR models.
Overall, the results look good to me. However, I’ve noticed in some modern Tibetan images, ink from the following page is often visible on the current page. You might want to explore augmentations related to this or any other useful augmentations from the following:
Hi @devatwiai ,
Thank you for your thoughtful review and suggestions. I appreciate the time you've taken to provide such valuable feedback.
Pecha Backgrounds: I haven't yet implemented the Pecha backgrounds from the backgrounds you provided, but I agree that incorporating real-world Pecha backgrounds is a fantastic idea. I’ll be integrating those into the next phase of development to enhance the authenticity of the generated data.
Line Extraction: Regarding line extraction, I’ve implemented a detailed process to accurately extract individual lines from the generated images. The method involves rotating the augmented image to align the text lines horizontally, creating a blank reference image to determine bounding boxes for each line, and then cropping the lines based on these bounding boxes. We ensure that only non-blank lines are extracted, and if the image was rotated, we rotate the extracted lines back to their original orientation. This approach ensures precise and clean extraction of text lines, making them ready for further analysis or processing. You can find an example of the line extraction process Extracted Lines.
Font Color: Currently, I'm generating images using black font exclusively. However, you've raised an excellent point about incorporating other color maps, particularly red and faded red, which are common in Pechas and can be challenging for OCR models. I’ll definitely explore this further to improve the diversity and robustness of the training data.
Additional Augmentations: Your suggestion about augmentations, especially regarding visible ink from following pages and the augmentations listed under Image Effects, Transforms, and Text Effects, is greatly appreciated. I’m keen on exploring these options to further refine and expand the augmentation techniques we’re currently using.
Overall, I’m grateful for your insightful feedback, and I'm excited to incorporate these suggestions to improve the quality of our synthetic data.
Hello @devatwiai , I wanted to clarify that I previously referenced the wrong repository. The examples of extracted lines for both the modern book synthetic dataset and the Pecha synthetic dataset are provided in the above 2 comments. The line extraction process for these datasets differs from the previous method.
The line extraction process involves the following steps:
This process is repeated for all lines, resulting in a collection of line images extracted from the page.
@ta4tsering,I’ve attached several variations of line spacing as you suggested in previous comments. In the earlier versions, the gap between lines seemed quite large and unrealistic. From my perspective, a line spacing of 8 appears to be more realistic. Could you please review the attached samples and let me know which line spacing you think looks the best?
Hello @devatwiai , I wanted to clarify that I previously referenced the wrong repository. The examples of extracted lines for both the modern book synthetic dataset and the Pecha synthetic dataset are provided in the above 2 comments. The line extraction process for these datasets differs from the previous method.
The line extraction process involves the following steps:
- Convert to Grayscale: The page image is converted to grayscale.
- Apply Threshold: A threshold is applied to make the text white and the background black.
- Find Contours: The contours of the text lines are detected and sorted from top to bottom.
- Create Bounding Boxes: For each line, a bounding box is created, and an image region is extracted with additional space above and below the line to achieve a more realistic appearance.
- Resize and Collect: The extracted line image is resized to match the page width and added to a list.
This process is repeated for all lines, resulting in a collection of line images extracted from the page.
@jim-gyas Line extraction looks fine to me on these images for the purpose of OCR training. Since, we are already mentioning spacing while generating the whole page, it would be good if we use that information here as well to decide the space above and below the line before extracting.
@jim-gyas Line extraction looks fine to me on these images for the purpose of OCR training. Since, we are already mentioning spacing while generating the whole page, it would be good if we use that information here as well to decide the space above and below the line before extracting.
@devatwiai Okay, I will look into that approach to extract lines. Thanks for the suggestion!
Pecha Format With Background:
玉翅+zhuca+weizang Font without Augmentation:
玉翅+zhuca+weizang Font with Augmentation:
In these images with different backgrounds, I noticed a white boundary/noise around the rendered text. What could be causing this?
Pecha Format With Background:
玉翅+zhuca+weizang Font without Augmentation:
玉翅+zhuca+weizang Font with Augmentation:
In these images with different backgrounds, I noticed a white boundary/noise around the rendered text. What could be causing this?
Actually, @devatwiai, the white boundary or noise around the rendered text is part of the background image itself. This background was included in the repository you provided earlier. Here’s the background image I used for reference: I will try to remove the noise from the background and update you on the progress.
Pecha Format With Background:
玉翅+zhuca+weizang Font without Augmentation:
玉翅+zhuca+weizang Font with Augmentation:
In these images with different backgrounds, I noticed a white boundary/noise around the rendered text. What could be causing this?
Actually, @devatwiai, the white boundary or noise around the rendered text is part of the background image itself. This background was included in the repository you provided earlier. Here’s the background image I used for reference: I will try to remove the noise from the background and update you on the progress.
No, not about background noise but rather artifact within the text itself. If you closely look the text, you'll notice it includes not only black but also faded white color as well. It wasn't there in older images maybe wasn't visible because of the white background?
Pecha Format With Background:
玉翅+zhuca+weizang Font without Augmentation:
玉翅+zhuca+weizang Font with Augmentation:
In these images with different backgrounds, I noticed a white boundary/noise around the rendered text. What could be causing this?
Actually, @devatwiai, the white boundary or noise around the rendered text is part of the background image itself. This background was included in the repository you provided earlier. Here’s the background image I used for reference: I will try to remove the noise from the background and update you on the progress.
No, not about background noise but rather artifact within the text itself. If you closely look the text, you'll notice it includes not only black but also faded white color as well. It wasn't there in older images maybe wasn't visible because of the white background?
Thanks for pointing that out, @devatwiai. It seems the issue is with the text itself, which contains some faded white areas in addition to the black. The artifacts within the text are likely due to the way the text rendering and masking are being processed in the script. I'll check the rendering process and the use of thresholds to see if we can clean up the text output better. Thanks for bringing this to my attention!
Description:
This project updates the packaging to support generating synthetic OCR training data for Tibetan script in both Tibetan Pecha and Modern Tibetan Book formats. Enhancements include updates to the line extraction script and support for varied fonts, dimensions, and random font sizes. For Tibetan Pecha pages, 40% have a background and 60% do not. Pages undergo random augmentations, and lines are extracted from both augmented and non-augmented pages. The primary goal is to create a diverse synthetic dataset for training OCR models to recognize Tibetan script more accurately.
Completion Criteria:
1) Pecha Format :
Pecha Format Dimension:
Pecha Format font Sizes :
Pecha Format font Colors:
List of Augmentation to be applied:
Total Number of Synthetic Training Dataset generated by Pecha Format (Durtsa short Writing style) : 43,470 Dataset
2) Modern Format
Modern Book Format Dimension:
Modern Book Format font Sizes :
Total Number of Synthetic Training Dataset generated by Modern Book Format (Durtsa short Writing style) : 43,470 Dataset
Resources:
1)Generating Synthetic Training Data Package: https://github.com/OpenPecha/SynthImage 2)Text file: https://github.com/OpenPecha/synthetic-data-for-ocr/tree/main/texts/kangyur 3)Fonts Repo:
4)Background repo: https://github.com/OpenPecha/synthetic-data-for-ocr/tree/main/backgrounds
Implementation:
Subtask