Open karndeepsingh opened 2 years ago
Hi,
I'm no expert on this and was just browsing the code while studying the paper but I think to be able to use the model for a different language you need to just change the word embedding. Also, the authors mention that in the paper they initialize their transformer with the weights of a pretrained ViT (no language is involved, it's just for visual features).
So just training the model from scratch with the same initial weights as ViT and changing the word embedding should work.
I know it has been some time since you asked your question but I thought it wouldn't hurt to try and answer. Hopefully it was helpful!
Hi, I have Image and Description of Products which is in Spanish language and want to train a classifier model using ViLT. What kind of pretrained model shall I use to train it on my Spanish Text and Image? I assume the model shared was trained in the English language it won't help me to train on my Spanish text. Correct me If I am wrong!
Please also suggest me the best way to train the model. Shall I train the model from scratch if not then which model weight shall I use to train on my dataset.
Thanks