Hello, thank you for your published paper and the open model. I am preparing to use your method to train on LaTeX-type data, such as im2latex. I would like to ask for your opinion on this task.
I currently have two concerns:
The textual data corresponding to LaTeX images has less semantic meaning compared to STR (Scene Text Recognition) tasks. I'm unsure whether the CLIP4STR method is applicable and if it has an advantage over trocr.
The character set for LaTeX recognition tasks far exceeds the 94 characters in the English set. For example, the formula recognition model trained based on trocr, as seen in this link, has approximately 1200+ tokens.
I would greatly appreciate any advice you may propose.
The CLIP text encoder is pre-trained on meaningful text, I think this would be an advantage. Well, this is just a conjecture, LaTex images have longer context than captions of images (training data of CLIP). Maybe some other document pre-trained VLM backbones work better. I think this needs a test =_=.
Hello, thank you for your published paper and the open model. I am preparing to use your method to train on LaTeX-type data, such as im2latex. I would like to ask for your opinion on this task.
I currently have two concerns:
The character set for LaTeX recognition tasks far exceeds the 94 characters in the English set. For example, the formula recognition model trained based on trocr, as seen in this link, has approximately 1200+ tokens.
I would greatly appreciate any advice you may propose.