Using the LaTeX dataset to train CLIP4STR

VamosC / CLIP4STR

An implementation of "CLIP4STR: A Simple Baseline for Scene Text Recognition with Pre-trained Vision-Language Model".

Apache License 2.0

123 stars 15 forks source link

Hello, thank you for your published paper and the open model. I am preparing to use your method to train on LaTeX-type data, such as im2latex. I would like to ask for your opinion on this task.

I currently have two concerns:

The textual data corresponding to LaTeX images has less semantic meaning compared to STR (Scene Text Recognition) tasks. I'm unsure whether the CLIP4STR method is applicable and if it has an advantage over trocr.
The character set for LaTeX recognition tasks far exceeds the 94 characters in the English set. For example, the formula recognition model trained based on trocr, as seen in this link, has approximately 1200+ tokens.

I would greatly appreciate any advice you may propose.

Hi, thanks for reaching out.

Just some personal thoughts:

The CLIP text encoder is pre-trained on meaningful text, I think this would be an advantage. Well, this is just a conjecture, LaTex images have longer context than captions of images (training data of CLIP). Maybe some other document pre-trained VLM backbones work better. I think this needs a test =_=.
I think we need to reconstruct the tokenization part of the code. As you see, the character set https://github.com/VamosC/CLIP4STR/blob/main/configs/charset/94_full.yaml The tokenizer https://github.com/VamosC/CLIP4STR/blob/d18f2f4b98b7e3dc1a59a845a6940997a4e9c09c/strhub/data/utils.py#L46 https://github.com/VamosC/CLIP4STR/blob/d18f2f4b98b7e3dc1a59a845a6940997a4e9c09c/strhub/data/utils.py#L107 CLIP4STR basically does some classification jobs. We need to extend the classes now.

VamosC / CLIP4STR

Using the LaTeX dataset to train CLIP4STR #14