Open louisdeneve opened 2 years ago
Hi,
Sadly Microsoft didn't open-source any pre-training code. Regarding masked image modeling, I recommend checking out the run_mim.py script which I added to Transformers, which allows to do masked image modeling on a custom dataset. It includes a MaskGenerator, which generates a random boolean mask to indicate which patches to mask.
Regarding word-patch alignment, that's a rather complex one:
The WPA objective is to predict whether the corresponding image patches of a text word are masked. Specifically, we assign an aligned label to an unmasked text token when its corresponding image tokens are also unmasked. Otherwise, we assign an unaligned label. We exclude the masked text tokens when calculating WPA loss to prevent the model from learning a correspondence between masked text words and image patches.
Here, you would need to use the bounding box information of a word to know the correspondence with the image patches.
Hello,
I have also the same difficulties.
Particularly for WPA, I don't fully understand how to do the correspondence between a word/segment and the image patches. For example, if after splitting the image into patches the word belongs to more than one patch how can we define the WPA?
Hi, as @NielsRogge said Transfomers have updated the code for mask image modeling and the code is based on DEIT. You can inherit the code to implement the Mask Image Modeling for LayoutLMv3 and also you can also inherit the code from RoBERTa to implement the mask language modeling. For the word-patch alignment, I am still in progress. I create an issue, free feel to have any discussion.
Currently I'm trying to adapt the tutorial code for LayoutLMv3 on my own local data, which is unlabeled. I want to do domain adaptation to improve the base model's performance. So basicly I want to pre-train the model further on my own local data before fine tuning it. Currently my data has this structure:
I'm having trouble masking the data like they do in the paper. To mask the text part I could use the DataCollatorForLanguageModeling but this only masks the texts and limits the data collator I have to give to the Trainer. I'm trying to find how to do the Masked Image Modeling (MIM) as well as the Word-Patch Alignment (WPA) in combination with the Masked Language Modeling (MLM) they describe in the paper. Does anyone know how to do this or how they implemented this?