NielsRogge / Transformers-Tutorials

This repository contains demos I made with the Transformers library by HuggingFace.
MIT License
9.17k stars 1.42k forks source link

How to change LayoutLMv3 to a LayoutLMv3 XLM (i.e. LayoutXLM-like model) #253

Open piegu opened 1 year ago

piegu commented 1 year ago

Hi @NielsRogge,

I plan to finetune a LayoutXLM large like model. Why "like model"? Because until now, Microsoft did not release LayoutXLM large mas only a version base.

As I want to train a version large on documents of another language as English (I have a huge labeled documents dataset), I need to change the LayoutLMv3 large tokenizer from RoBERTa to XLM RoBERTa.

I understand from the page 9 of the paper LayoutLMv3: Pre-training for Document AI with Unified Text and Image Masking, that this change was done for the LayoutLMv3 in Chinese:

For the multimodal Transformer encoder along with the text embedding layer, LayoutLMv3-Chinese is initialized from the pre-trained weights of XLM-R [7]. We randomly initialized the rest model parameters.

From this citation, I understand that Microsoft copied/pasted the embeddings weigths of XLM-RoBERTa to the ones of LayoutLMv3.

  1. How to do that?
  2. and how to change also the LayoutLMv3Tokenizer to the XLM-RoBERTa one?

About the question 2, it would be as simple as what you did in your notebook "CreateLiLT+_XLM_RoBERTa_base.ipynb" for a LiLT model with XLM-RoBERTa?

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("xlm-roberta-base")
tokenizer.push_to_hub("nielsr/lilt-xlm-roberta-base")

Thank you.

NielsRogge commented 1 year ago

To create a LayoutXLM-large, you can copy paste the weights from xlm-roberta-large into the model. You can do that by converting the weights as done in any conversion script, like this one. You basically need to load the original state_dict and then manipulate the dictionary to make it have the appropriate names for the parameters names in the model that you'd like to equip the weights with.

To use a tokenizer, you can use the LayoutXLMTokenizer (and LayoutXLMProcessor) which is already available in 🤗 Transformers.

I don't think it's possible to leverage LiLT for large models, as LiLT only comes in a base-sized version.

philmas commented 9 months ago

I have come across this issue looking for an answer whether it is possible to use LayoutLMv3 for alternative languages. In my case I am interested in Dutch. What is at this moment the best way to proceed given that I have a dataset in Dutch which I can use for training.

NielsRogge commented 9 months ago

@philmas the best way is to combine LiLT with a Dutch roberta-base model.

For Dutch, the best roberta-base model is currently this one: https://huggingface.co/DTAI-KULeuven/robbert-2023-dutch-base.

Hence please follow this guide to combine it with LiLT: https://github.com/jpWang/LiLT?tab=readme-ov-file#or-generate-your-own-checkpoint-optional. LiLT is a lightweight module which can be combined with any RoBERTa-base model, giving you a LayoutLM-like model for any language for which there's a RoBERTa-base model available. Refer to my notebooks regarding fine-tuning LiLT.

philmas commented 9 months ago

@NielsRogge Thank you!

I have created the model and can be found here.

Now I have to proceed training/finetuning it. Do you know by any chance what dataset sizes are adequate. Are we talking 100s, or 1000s? Perhaps even more? I will be using it as a token classifier.

NielsRogge commented 9 months ago

Usuallly I recommend starting with 100, but as always with deep learning, the more the better

LuckaGianvechio commented 5 months ago

Hello @NielsRogge, I am currently trying reproduce pretraining of a LayoutLMv3 model in a Brazilian Portuguese dataset.

Is there a way to initialize the model's weights with another BERT based model called Bertimbau (since it isn't a roberta-base model) ? I have currently initialized only the word embeddings layer with Bertimbau's, and reimplemented the tokenization to work with Bertimbau's token ids, but I think I could achieve better results by initializing more parameters with it.

Also, I haven't found where to provide the model information about witch image patches to mask (for the MIM task) in the hugging face transformers modeling_layoutlmv3.py code. Do I need to implement this by hand in LayoutLMv3Model class or is there something already implemented?

Thanks in advance.