Closed Serge9744 closed 1 year ago
1/ Is it preferable , instead of starting from the TROCR large stage 1, to start from a VIT Model (vit base from google) and Camembert as the decoder ?
Yes this might be beneficial. The TrOCR-large-stage1 model only knows English tokens. Hence it make make sense to instantiate a new VisionEncoderDecoderModel as shown in the docs.
Hi ! Many thanks for the reply ! Should we correct the labels or let it like with errors ? Woukd you have an idea on that topic ? Thanks
Sorry also had to answer your second question :) I would make the model learn the correct spelling, so correct the labels
Thanks, I understood your argument for the trocr stage 1 model, but what is your idea on the label correction ? The correct conditional probabilities would be learnt ? Like an embedded auto correct ?
Also Should I train a BPE tokenizer like I did on the training data or use the decoder tokenizer of Camembert ?
Thankd for everything
The correct conditional probabilities would be learnt ? Like an embedded auto correct ?
Yes the model should learn to output the correct text.
You can use the tokenizer of CamemBERT. You can create a model like so (for example):
from transformers import VisionEncoderDecoderModel
model = VisionEncoderDecoderModel.from_encoder_decoder_pretrained(
"microsoft/swin-base-patch4-window7-224-in22k", "camembert-base"
)
Many thanks everything is clear then.
I'll use the tokenizer from Camembert instead of a custom BPE learnt on the text then
Can you please first fix the formatting of your message? It's pretty hard to read
Hi @NielsRogge ,
I am currently trying to fine tune TrOCR on handwritten French medical data that we labelled from images of handwritten texts.
So what I did was : 1/ Loading the first stage TROCR model 2/ Create a BPE tokenizer on the training French medical data and then assing it for the decoder
this is the custom tokenizer :
loaded_tok = Tokenizer.from_file(text_processor_path)
then i wrap it :
Here I load ../models/TrOCR_large_stage1
here i assign the decoder tokens and vocab size to be the one from the custom decoder
My questions would be :
1/ Is it preferable , instead of starting from the TROCR large stage 1, to start from a VIT Model (vit base from google) and Camembert as the decoder ?Or my approach to use the TROCR large stage 1 + a custom BPE tokenizer would be fine on French handrwitten data?
2/There are some typos on the texts , for example we have in an image a text like this : "Dibète" and the true word is "Diabète". Hence the label is uncorrect and missing an "a" . Should we correct the label to "Diabète" to start learning even if the character is missing? As there is a language model and beam search at the decoder stage , could it kind of learn an auto correction and predict Diabètes even if it sees "Dibètes"? Or should we keep the labels as they are for better learning and then correct them at the prediction stage?