Serge9744 commented 1 year ago

Hi @NielsRogge ,

I am currently trying to fine tune TrOCR on handwritten French medical data that we labelled from images of handwritten texts.

So what I did was : 1/ Loading the first stage TROCR model 2/ Create a BPE tokenizer on the training French medical data and then assing it for the decoder

this is the custom tokenizer :

loaded_tok = Tokenizer.from_file(text_processor_path)

then i wrap it :

wrapped_tokenizer = PreTrainedTokenizerFast(
    tokenizer_object=loaded_tok,
    bos_token= '<s>',
    eos_token= '</s>',
    pad_token='[PAD]',
    mask_token= '<mask>',
    unk_token= '<unk>',
    sep_token='</s>',
    cls_token= '<s>')

#creating train and dev tensors
train_dataset= HandwrittenDataSet(df_train_total,processor , wrapped_tokenizer , max_pad_length)
dev_dataset= HandwrittenDataSet(df_dev,processor, wrapped_tokenizer , max_pad_length) 
#creating train and dev loaders
train_dataloader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
dev_dataloader = DataLoader(dev_dataset, batch_size=batch_size)

Here I load ../models/TrOCR_large_stage1

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
base_model = VisionEncoderDecoderModel.from_pretrained("../models/TrOCR_large_stage1")
base_model.to(device)

here i assign the decoder tokens and vocab size to be the one from the custom decoder

base_model.config.decoder_start_token_id = wrapped_tokenizer.bos_token_id
base_model.config.pad_token_id = wrapped_tokenizer.pad_token_id
base_model.config.vocab_size = wrapped_tokenizer.vocab_size
base_model.config.eos_token_id = wrapped_tokenizer.sep_token_id
base_model.tokenizer = wrapped_tokenizer

My questions would be :

1/ Is it preferable , instead of starting from the TROCR large stage 1, to start from a VIT Model (vit base from google) and Camembert as the decoder ?Or my approach to use the TROCR large stage 1 + a custom BPE tokenizer would be fine on French handrwitten data?

2/There are some typos on the texts , for example we have in an image a text like this : "Dibète" and the true word is "Diabète". Hence the label is uncorrect and missing an "a" . Should we correct the label to "Diabète" to start learning even if the character is missing? As there is a language model and beam search at the decoder stage , could it kind of learn an auto correction and predict Diabètes even if it sees "Dibètes"? Or should we keep the labels as they are for better learning and then correct them at the prediction stage?

NielsRogge commented 1 year ago

1/ Is it preferable , instead of starting from the TROCR large stage 1, to start from a VIT Model (vit base from google) and Camembert as the decoder ?

Yes this might be beneficial. The TrOCR-large-stage1 model only knows English tokens. Hence it make make sense to instantiate a new VisionEncoderDecoderModel as shown in the docs.

Serge9744 commented 1 year ago

Hi ! Many thanks for the reply ! Should we correct the labels or let it like with errors ? Woukd you have an idea on that topic ? Thanks

NielsRogge commented 1 year ago

Sorry also had to answer your second question :) I would make the model learn the correct spelling, so correct the labels

Serge9744 commented 1 year ago

Thanks, I understood your argument for the trocr stage 1 model, but what is your idea on the label correction ? The correct conditional probabilities would be learnt ? Like an embedded auto correct ?

Also Should I train a BPE tokenizer like I did on the training data or use the decoder tokenizer of Camembert ?

Thankd for everything

NielsRogge commented 1 year ago

The correct conditional probabilities would be learnt ? Like an embedded auto correct ?

Yes the model should learn to output the correct text.

You can use the tokenizer of CamemBERT. You can create a model like so (for example):

from transformers import VisionEncoderDecoderModel

model = VisionEncoderDecoderModel.from_encoder_decoder_pretrained(
    "microsoft/swin-base-patch4-window7-224-in22k", "camembert-base"
)

Serge9744 commented 1 year ago

Many thanks everything is clear then.

I'll use the tokenizer from Camembert instead of a custom BPE learnt on the text then

NielsRogge commented 1 year ago

Can you please first fix the formatting of your message? It's pretty hard to read

NielsRogge / Transformers-Tutorials

Fine tune TROCR on french medical data #270

this is the custom tokenizer :

then i wrap it :

Here I load ../models/TrOCR_large_stage1

here i assign the decoder tokens and vocab size to be the one from the custom decoder