OCR0042: Synthetic Data Creation Research using Diffusion

Norbu-Jamling commented 1 month ago

Description: Following up after gaining valuable knowledge from 'OCR0027 Font Style Transfer Research', we learn the single valuable lesson: DIFFUSION OUTPERFORMS GANS. We now Start our research on Diffusion models and their types and see how it can aid us in generating synthetic data from various scripture fonts currently present in our OCR data.

Background: Since our Transformer Based OCR model is data hungry it requires millions of image data to be effective unlike other models like LSTMs which may generalise even on smaller data size. We have data of scriptures that have certain style to it throughout. Some scriptures have only few lakhs of images in it. while few lakhs of data might not be enough for our Transformer based OCR, it may be enough to create a generative ai model. This way we may train a generative model on our data and then create millions of synthetic data from it. this way we can drastically improve our TrOCR.

Experiments and its Outcomes:

[x] Simple Diffusion model on glyphs Works great! Can be used to create diverse fresh high quality glyphs each time.
[x] Simple UNET model for img2img translation Difficult to produce good quality realistic images that resemble LK dataset. Content of input image is kept, but style is not transferred.
[x] Conditional DDPM model with Img condition Can produce realistic images that closely resembles LK dataset, but fails to reflect the content of condition.
[x] Conditional DDPM model with vector condition Can produce realistic images that closely resembles LK dataset, but fails to reflect the content of condition.
[x] Diffusion with discriminator model Can produce realistic images that closely resembles LK dataset, but fails to reflect the content of condition.
[x] Diffusion with OCR loss model Code has been written but not run with google vision OCR. Not enough testing done on this.

Detailed Summary of each Experiment: In Github repo wiki: wiki

Future Potential Solution: 1) Glyph based Solution: Diffusion is really good at making single glyph images. It has all the qualities we desire- high quality, diverse, resembles the dataset, fresh each time. But we need our data in form of sentences. So we can use ai trained model to resize the glyph images, use ai to mark baseline, and print out sentences one character at a time. This also solves the problem of overfitting in manual font method OCR0041.

2) UNET based solution: Currently UNET is pretty good at keeping the content but struggling with style. This is because of the loss function. pixel wise comparison using MSE is not working well as the input computer font image characters don't align with the characters in LK dataset images. Best option we had is VGG perceptual loss. Which is not doing a good job on style transfer.

We can tell annotators to move the characters of condition image to align with the LK image. We may only need few thousand to be done by and then train an ai to do this aligning job.

When we have entire dataset with aligned characters then the UNET will work with MSE pixel comparison loss and may produce good results.

3) Word based solution: Since sentences is too complex for machine, we may try with words. Word level comes in-between glyph level and sentence level. It has the following feature:

Less complex conditional diffusion than sentences
More realistic looking character appending than glyph based

We will need annotators to crop out images of words from the sentences. Then try the conditional diffusion models on word dataset. Final output by model will be word images so we will need to append words ignorer to create sentence images.

4) OCR loss solution: Not enough testing has been done on OCR loss function cddpm.

OCR created by monlamis not able to do anything on LK dataset. While Google vision OCR is performing very well. Code has been written and tested using Monlam OCR. Not done using Google vision OCR API .

Need to try it out with google OCR API.

Norbu-Jamling commented 1 month ago

currently working on simple auto encoder to generate images from transcription text. tested for 1k image. next will train entire Lhasa Kanjur 1.5 lakh data.

on validation set:

on training set itself:

model has overfitted on training 1k training data. now see if increasing training data will help it generalise better.

ta4tsering commented 1 month ago

the trails wasn't fruitful so will be looking into more approaches.

Norbu-Jamling commented 1 month ago

Trained a diffusion DDPM model on 6k glyph images of derge and pacing glyphs. Images generated are similar to the ground truth. it generates fresh images every time. potential: can be used for synthetic image generation done using script. currently it uses 10 images of a glyph and randomly choses one each time. if we use this model to generate, the glyphs will be fresh and not seen by the model till now.

below are images generated by model:

this is currently randomly generating any glyph. we can generate specific character on demand if we use conditional ddpm instead easily. but then we might need more than 10 images per character for good quality and diversity in the images for that character.

Norbu-Jamling commented 1 month ago

This work of generating image given its transcription is pretty complex. This is because: 1) if we use transcription vector2image then the task of learning the entire tibetan language as it needs to map individual character to its image is too complex. 2) if we use image2image then the transcription image don't really align with Lhasa Kanjur images on the pixel level. so loss functions like L1 and L2 that compare on the pixel level do badly.

Till now we have learned Diffusion models are great at generating diverse and high quality images. We can leverage this quality of diffusions to produce good images, but to make the diffusion use transcription is a difficult task.

Currently best options are if we work on glyph level. or if we want to work on LK line dataset, a possible solution would need annotators to map/edit the computer images to align over the LK image. This is just an idea for the time being.

Norbu-Jamling commented 1 month ago

Research is still going on to make a model that has all these qualities together: 1) meaning- makes use of the content provided in the transcription 2) Image Quality- generates high quality images that are very close to the ground truth 3) Diversity- Produces diverse images each time.

Just a general rule:

Norbu-Jamling commented 1 month ago

Trained a diffusion ddpm model on LK dataset to produce images that look similar. This doesn't make use of transcription. This is to show diffusion models strong points - high quality diverse image generation.

Maybe training longer and on entire dates over more epochs may teach the model to generate images that have some meaning. even if this works then we will need annotators to transcribe it.

Images generated by model (trained on 30k images for 1.5 epoch):

Research and trials are going on to get such level of images but also using transcription for meaning

Norbu-Jamling commented 1 month ago

trained a conditional ddpm on 1.3 lakh training dataset lhasa Kanjur for 6 epochs,12 hours. this model takes transcription vector too in order to learn to generate meaningful images. but till now it has failed to do so.

we can see the machine producing random words.

one reason may be our incentive structure- currently for this model we are only using MSE loss. this tells the model to generate realistic images. but providing just the transcription vector in the model may not be enough incentive for model to learn these mapping of vector to image.

one possible incentive structure:

Using our pretrained OCR as an additional loss function. OCR will detect what's written in the generated image and produce its vector form. we can then compare the original vector and the generated image's vector and then back propagate to reduce this distance.

kaldan007 commented 1 month ago

regarding OCR model, reach out to @ta4tsering

ta4tsering commented 1 month ago

https://huggingface.co/BDRC/PhotiLines this is the line segmentation model and then https://huggingface.co/BDRC/Woodblock this is the OCR model and you can use the https://github.com/OpenPecha/monlam_ocr.git this the only repo we have that can run the models.

Norbu-Jamling commented 1 month ago

changed this card's serial number from OCR0038 TO OCR0042 as there were two cards with same serial number 38

Norbu-Jamling commented 1 month ago

reading the research paper: https://openaccess.thecvf.com/content/CVPR2023/papers/Zhu_Conditional_Text_Image_Generation_With_Diffusion_Models_CVPR_2023_paper.pdf

and looking to document/update Github repo with all important codes/file.

Norbu-Jamling commented 1 month ago

Uploaded , research papers, Project codes and playground codes into GitHub. @TenzinGayche will be doing code review

Norbu-Jamling commented 1 month ago

Updated the card description to include Plan of action, its outcomes and Future Potential Solution.

TenzinGayche commented 1 month ago

@Norbu-Jamling LGTM!!

OpenPecha / Synthetic-Data-Creation-using-Diffusion

OCR0042: Synthetic Data Creation Research using Diffusion #1