OCR0027: Font style transfer

Norbu-Jamling commented 5 months ago

Description In order to create synthetic data for OCR, we try out the approach of font style transfer using deep learning. Model will transfer font style onto an image of text given. Now research on various technologies in-order to achieve this.

Initial Inspiration: We build a model that transfers the font of a short text image(target font) onto the image of text we provide (source font). paper referred to: Multi-Content GAN for few-shot font style transfer (https://openaccess.thecvf.com/content_cvpr_2018/papers/Azadi_Multi-Content_GAN_for_CVPR_2018_paper.pdf)

limitaions: needs a big dataset of font transferred pair dataset (10k in paper mentioned) Thus we can't follow this method completely.

2 Possible Solutions:

Glyph based font style transfer with transfer learning on Chinese handwritten glyph dataset.
Line based font style transfer on Lhasa Kanjur Dataset.

1) Glyph based font style transfer with transfer learning on Chinese handwritten glyph dataset

Pros: adds handwritten Tibetan font data to our OCR dataset which currently lacks in it. cons: Photoshopping process is done by human, and may take some time, ie bottleneck, to overcome this we might want the model to output images that are already scaled to fit the font template.

2) Line based font style transfer on Lhasa Kanjur Dataset.

Model: Generative Neural Networks like GANs, VAE, Pix2Pix, Diffusers, etc are available. We need to research, test them out and see which one is best suited for our task.

[x] GANs and its various types paper: Generative Adversarial Nets (https://arxiv.org/pdf/1406.2661v1)
[x] VAE
[x] Diffusers
[x] Pix2Pix

Project completion criteria Any of the above two proposed idea's working model.

[ ] Glyph based font style transfer with transfer learning on Chinese handwritten glyph dataset
[ ] Line based font style transfer on Lhasa Kanjur Dataset.

Tasks:

Initialising
[x] read research papers/literature on this topic
[x] write reports on the papers read
[x] write reports on possible solution and approaches we can use from the papers read
1) Glyph based font style transfer with transfer learning on Chinese handwritten glyph dataset
[x] simple GAN on Chinese glyphs
[x] simple GAN on Tibetan glyphs
[x] initial GAN model for Chinese Handwritten font transfer
[x] initial VAE model for Chinese Handwritten font transfer
[x] evaluate GAN model for Chinese Handwritten font transfer
[x] evaluate VAE model for Chinese Handwritten font transfer
[x] research industry GAN models to make Model more efficient
[x] research industry VAE models to make Model more efficient
2) Line based font style transfer on Lhasa Kanjur Dataset.
[x] simple GAN on Lhasa Kanjur Line
[x] Lhasa Kanjur Data Preprocess
[x] initial cGAN model for Lhasa Kanjur
[x] evaluate cGAN Lhasa Kanjur prototype made
[x] research industry cGAN models to make Lhasa Kanjur Model more efficient
[x] list further development paths
Alternative Approaches
[x] research VAEs models
[x] research Diffusion models
[x] research Pretrained models on style transfer for fonts
[x] research more types of GANS- style,W,cycle,etc.

Norbu-Jamling commented 5 months ago

Have explored the simple GANs and made report on 1 research paper- "Multi-Content GAN for few-shot font style transfer"

Norbu-Jamling commented 5 months ago

Looking into Diffusers technology as a possible solution

Norbu-Jamling commented 5 months ago

also need to look into cGANs and cycleGANs as a solution

Norbu-Jamling commented 5 months ago

Even though we want a few shot font style transfer for any given picture of a font, those require a diverse datasets of font mappings that's around 10k (research paper) fonts which we lack. So we go step by step testing out all types of possible solutions that achieves the goal or parts of the goal.

Norbu-Jamling commented 5 months ago

I have updated the subtasks

Norbu-Jamling commented 5 months ago

Made a conditional GAN model prototype for Line based font style transfer on Lhasa Kanjur Dataset, and trained it for a day. Seeing no results- Output image is meaningless noise. Loss functions are not decreasing for both generator and discriminator. Need to evaluate and see how to improve dataset, or model architecture

Norbu-Jamling commented 5 months ago

updated my card to give a more concise description of the project, explain the 2 possible methods and project completion criteria.

Norbu-Jamling commented 5 months ago

GitHub page updated with 2 new branches for the 2 methods

Norbu-Jamling commented 5 months ago

following pix2pix paper(https://arxiv.org/pdf/1611.07004) for handwritten chinese glyphs font transfer currently

Norbu-Jamling commented 4 months ago

successfully recreated the pix2pix paper with their maps dataset (1k dataset, 50 epochs-10 minutes)

now modifying it to use for chinese handwritten dataset.

Norbu-Jamling commented 4 months ago

trained the model on chinese glyphs(1k dataset,300 epochs,1hour) getting some vague results for chinese glyphs but model has overfitted, doesn't produce meaningful output for tibetan glyph will need to train on bigger and more diverse dataset with more languages and more epochs.

Norbu-Jamling commented 4 months ago

this model pix2pix did well for maps dataset where there's consistency within the 1k images of source or target. since there's little consistency within the handwritten images (because of variable stroke length/size/density) it is harder for model to learn conversions.

Norbu-Jamling commented 4 months ago

trained a model following pix2pix paper model architecture. (1k Lhasa Kanjur lines dataset, 75epochs, 10 minutes) producing images similar to Lhasa Kanjur font, but meaningless and producing the same image even for different input image. will train a bigger model using entire(1.3 lakh training) dataset using the same model to see improvement over these problems.

Norbu-Jamling commented 4 months ago

After training on entire Lhasa Kanjur dataset(1.5lakh dataset), for 15lakh steps /10 epochs, 2days these are the results

observation: 1) images produced is meaning-less and doesn't use the information provided by the input image, disregards the characters written in input image, but has some random letter images that highly resemble the lhasa kanjur font. 2) such images are present from starting few thousand steps too. training more doesn't produce meaning to these images. there is no progress.

points: 1) increasing dataset, ie using 1.5 lakh instead of 1k images won't improve the model that much, as the images are all similar and adding more images to training don't actually provide diversity. instead data augmentation technique like flipping images may improve the model(need to try this). even the paper uses around 1k dataset. 2) latent space dimension is (1x16x512) which is 1 height, 16 width, 512 features. the 256x4096 dimension image has around 100 characters (tibetan language having more than 3k unique combination for a glyph) the latent space is too small to capture the features of the input image, thus its producing meaningless texts highly identical to the LK font.

I have stopped the training of this model. need to increase latent space dimension. and if possible see data augmentation.

Norbu-Jamling commented 4 months ago

After training the modified model (increasing latent space to 16x256x512) after 100 epochs, 8 hours on 1k dataset.

points: 1) this time model is using the information provided in the input, but is blurry/noisy, doesnt produce as realistic images. as also mentioned in the paper, too big latent space produces blurry images. too small latent space doesn't capture the input info. a suitable latent space in middle produces best sharp images while using the info of input too.

I have stopped the training of this model. need to find a suitable latent space dimension in-between the above two models.

Norbu-Jamling commented 4 months ago

Model Potential:

if the dataset has consistency within itself, for example Lhasa Kanjur text is made from printing blocks, printing same font every time, or google maps dataset which have colour, size, shape consistency, then the model can effectively train on even 1k dataset.

This model architecture has the potential to produce as much images as we want, after training on just 1k dataset from that domain. Saving our resources by scanning only 1k images of any particular scripture type instead of the entire thing. and later making synthetic data. We can then shift our focus on getting little image datasets of as many different scripture types as possible rather that diving deep to cover all the images related to a single scripture type.

example scenario: Lhasa Kanjur: we would need to only scan 1k rather than 1.5lakh. Drepung: we would need to scan only 1k rather than 5lakh. other font types with limited data can be synthesised and be effectively represented/trained in our OCR model.

DrawBack: If the domain is not consistent, for example handwritten images ( having variable stroke shape, size, density ) then 1k data may not be sufficient to train the model effectively. Then bigger dataset will be required.

Norbu-Jamling commented 4 months ago

thought: 1) since lhasa kanjur dataset is mostly similar, training our OCR on all 1.5 lakh data wouldn't see any drastic improvement than a OCR trained on just 1k images. maybe we can train our existing OCR model architecture on just a fraction(1/100 or 1/1000) (ie a subset that represents the set properly) of our OCR data. We may get a model that performs near the benchmark of our OCR trained on entire data. This way we can test out many OCR models cheaply and in short time, and then compare different architectures/versions.

2) Even though pix2pix model may increase our data, it may not improve our OCR much if used this way. We need to transfer fonts that we don't even have 1k data on. this needs to be explored. if handwritten chinese font transfer to tibetan is successful, then any other font can be done the same way. Then it can add the OCR with completely fresh data, diversify it, improving our OCR drastically

Norbu-Jamling commented 4 months ago

After training the modified model (adjusting latent space to 4x64x2048) after 100 epochs, 8 hours on 1k dataset

point- not clear images, but can see the input information being reflected on the output.

need to experiment with hyper parameters and data augmentation methods.

ta4tsering commented 4 months ago

The model @Norbu-Jamling trained has used, Lhasa Kanjur line to text dataset which has both line images and transcriptions. So if we continue on this path, in the future if we require a style to be transfered to a line image, we need the line image and the transcriptions of that style we want to transfer. So its a drawback. He will continue with the experiments until the next strategy meeting.

Norbu-Jamling commented 4 months ago

After training the modified model (adjusting latent space to 8x128X1024) after 100 epochs, 2 hours on 1k dataset

ta4tsering commented 4 months ago

looking for already trained models that can be used on Tibetan scripts to transfer the style. Found two models and still looking for more models and research. Will email with the Devesh from wadwani for some questions.

ta4tsering commented 4 months ago

Preparing the report for the research papers read to present to Devesh from Wadwani over meeting tomorrow.

kaldan007 commented 4 months ago

did a KT session with Jinpa regarding Data augmentation. Continuing preparation for the session with devish

Norbu-Jamling commented 4 months ago

Held the meeting with Devesh pant and his work on Hindi font style transfer and font interpolation. Got to know that GANs don't work well in font transfer and diffusion models gave them better results. He made it clear to use diffusion over GANs. On the topic of usability for OCR data, if we make a diffusion model, it will generate slightly different fonts each time thus will improve OCR if we are able to make the model. Also got to know that making a model on lines will be pretty complex for machine to learn so starting off with glyphs is better. In the case of glyphs nothing beats the manual production as it's guaranteed to improve the OCR, and annotators will be needed to align the ai generated fonts on the font template. Even though the limitations with glyph based generation, it serves as a stepping stone if we want to tackle the more complex lines generation.

Also he told us all papers online are purely research based font style transfer and no resources are available talking about the end-to-end use of this in improving OCR or creating synthetic data.

A valuable tip he gave us was to not generate images out of noise, always use image to image translation, as generating an image from noise is much more complex and will give poor results, in the case of GANs

Another things was that models always learns to transfer content and not the style of the images. So we need to do something to decouple content from style someway be it loss function, etc. this is something even they failed to do perfectly. Definitely need to research on decoupling as its the main problem of font style transfer.

We came to the conclusion that after researching on this topic for over 1 month, its best to spend some time to cover all options before we close the project entirely.

We also got to know about a new area of interest- font interpolation which is making many intermediate fonts from a few fonts. Need to explore this topic more.

Devesh gave us a head start citing valuable research papers used by him

DG font (https://github.com/ecnuycxie/DG-Font) (best GAN option they found)
MG font Chinese GAN
Font Diffuser (https://arxiv.org/pdf/2312.12142)
Font Interpolation (https://github.com/pantDevesh/Font-Interpolation)
Word Stylist (https://arxiv.org/pdf/2303.16576)

word stylist model(recommended this paper to explore first) :

Norbu-Jamling commented 4 months ago

Currently learning basics of diffusion models then will proceed to recreate the word stylist paper.

TenzinGayche commented 3 months ago

@Norbu-Jamling LGTM

OpenPecha / Font-Style-Transfer

OCR0027: Font style transfer #1