VamosC / CLIP4STR

An implementation of "CLIP4STR: A Simple Baseline for Scene Text Recognition with Pre-trained Vision-Language Model".
Apache License 2.0
100 stars 14 forks source link

how to Finetuning in korean (or other language) #20

Closed simsimee closed 1 month ago

simsimee commented 1 month ago

Before I ask you a question, I would like to say thank you for sharing the good information.

I'm going to ask you 3 questions

  1. I tried to use the Multiligual-CLIP you mentioned in issue1, but the only models using the ViT-B in that model are ViT-B/16+ and ViT-B/32. Is it correct that only ViT-B/16 is available for the base model of the pretrained CLIP of CLIP4STR?

  2. Is there a way to produce inference results in different languages without additional Finetuning?

  3. Do you have any plans to write a guide for Finetuning CLIP4STR in another language?

mzhaoshuai commented 1 month ago

Hi, thx for reaching out.

  1. I do believe ViT-B/16 model is a decent choice for STR fine-tuning. Training data has a bigger influence. You can collect more high-quality data to improve your model performance. ViT-B/16 is also more effecient, maybe a better choice in real application.

  2. Nope. I do not know a practical way to do this. Some fancy ways like using diffusion models or some other methods to directly transfer the image with Korean in it to an image with English in it, may not have decent performance.

  3. Sorry. I currently work on other topics.

simsimee commented 1 month ago

Thank you for your response, and thank you for sharing your wonderful project.