NielsRogge / Transformers-Tutorials

This repository contains demos I made with the Transformers library by HuggingFace.
MIT License
9.49k stars 1.45k forks source link

LayoutLMv2 - Merging word-boxes that should be together but are separated by OCR #79

Open aripo99 opened 2 years ago

aripo99 commented 2 years ago

Hi,

Firstly, thank you for all the great notebooks, they have been an amazing source of learning!

Sometimes, when using OCR some fields are separated into multiple words-boxes but they actually correspond to only one label with a sequence of words and a bigger box. For example, we could be labeling addresses and get a situation where 94120 State St is separated into 3 words/boxes. Using layout for token classification, we may get the right labels for all these words, but they still correspond to different boxes. My question was if there’s any good approaches to joining all these labeled words into only one labeled box? One possible solution would be to do this manually (when nearby words get same label just assume they are a big word box) but maybe there’s a better solution? Do you have any input on this?

Thank you so much :)

jyotiyadav94 commented 2 years ago

@aripo99 did you find any solution for this ?

NielsRogge commented 2 years ago

Hi,

My question was if there’s any good approaches to joining all these labeled words into only one labeled box?

This is a very relevant question indeed! It turns out that the performance greatly improved if you use so-called segment position embeddings (like the example you gave, "94120 State St" would be a single segment) compared to word position embeddings.

The authors of LayoutLMv3 mention this in the paper:

Note that LayoutLMv3 and StructuralLM use segment-level layout positions, while the other works (LayoutLM, LayoutLMv2) use word-level layout positions. The use of segment-level positions may benefit the semantic entity labeling task on FUNSD [25], so the two types of work are not directly comparable.

The authors of LayoutLMv3 got an F1 score of 92% on FUNSD with the large-sized variant thanks to using segment position embeddings, whereas its predecessor used word-level position embeddings (and scored much lower).

So the TLDR is that, in case you have segments, it's always better to leverage the same position embeddings (or, in other words, the same bounding box coordinates) for them, rather than using a separate position embedding (or bounding box coordinate) for every individual word.

You can see how the authors of LayoutLMv3 created segment coordinates for FUNSD here: https://huggingface.co/datasets/nielsr/funsd-layoutlmv3/blob/main/funsd-layoutlmv3.py#L140

jyotiyadav94 commented 2 years ago

@NielsRogge

Thanks for the detailed solution, Is there any google collab Implementation code for LayoutLMV3?

NielsRogge commented 2 years ago

It can be found here: https://github.com/NielsRogge/Transformers-Tutorials/tree/master/LayoutLMv3

jyotiyadav94 commented 2 years ago

@NielsRogge

Thanks a lot, it seems really a great effort & appreciate your work and also the work you have been doing for the open community. I just have the last question are there any guidelines somewhere for saving the predicted output in the form of key-value pairs? It was able to find it for the LayoutLMV1 & LayoutLMV2 model But I couldn't find much resources for V3.

jyotiyadav94 commented 2 years ago

@NielsRogge ,

How can we get the predicted output for LayoutLMV3 is there any implementation already defined? I have been stuck in this for a very long time. If you have any ideas it would be great. image

Can I get this output from inference?