NielsRogge / Transformers-Tutorials

This repository contains demos I made with the Transformers library by HuggingFace.
MIT License
9.47k stars 1.45k forks source link

LayoutLM reading order for token classification #156

Open AleRosae opened 2 years ago

AleRosae commented 2 years ago

Hi, thank you very much for all the work that you have done, it is a huge help :) I have noticed that the way in which text is preprocessed for LayoutLMv1 (and I assume also for further versions) does not take into account the reading order. For instance, for example image that it is shown at the beginning of this notebook, the output in train.txt is:

R&D O
:   S-QUESTION
Suggestion: S-QUESTION
Date:   S-QUESTION
Licensee    S-ANSWER
Yes S-QUESTION
No  S-QUESTION
597005708   O
R&D B-HEADER
QUALITY I-HEADER
IMPROVEMENT I-HEADER
SUGGESTION/ I-HEADER
SOLUTION    I-HEADER
FORM    E-HEADER
[..etc]

but what I assume is the correct reading order should be something like:

R&D B-HEADER
QUALITY I-HEADER
IMPROVEMENT I-HEADER
SUGGESTION/ I-HEADER
SOLUTION    I-HEADER
FORM    E-HEADER
NAME B-QUESTION
PHONE I-QUESTION
EXT E-QUESTION
M. B-ANSWER
HAMANN E-ANSWER
[...etc]

Is this irrelevant for training LayoutLM for token classification tasks? When we create a custom dataset, should we insert the text-label pairs following the reading order provided by OCR or it does not matter?

YangjiaqiDig commented 1 year ago

@NielsRogge

Hi, thank you very much for all the work that you have done, it is a huge help :) I have noticed that the way in which text is preprocessed for LayoutLMv1 (and I assume also for further versions) does not take into account the reading order. For instance, for example image that it is shown at the beginning of this notebook, the output in train.txt is:

R&D   O
: S-QUESTION
Suggestion:   S-QUESTION
Date: S-QUESTION
Licensee  S-ANSWER
Yes   S-QUESTION
No    S-QUESTION
597005708 O
R&D   B-HEADER
QUALITY   I-HEADER
IMPROVEMENT   I-HEADER
SUGGESTION/   I-HEADER
SOLUTION  I-HEADER
FORM  E-HEADER
[..etc]

but what I assume is the correct reading order should be something like:

R&D   B-HEADER
QUALITY   I-HEADER
IMPROVEMENT   I-HEADER
SUGGESTION/   I-HEADER
SOLUTION  I-HEADER
FORM  E-HEADER
NAME B-QUESTION
PHONE I-QUESTION
EXT E-QUESTION
M. B-ANSWER
HAMANN E-ANSWER
[...etc]

Is this irrelevant for training LayoutLM for token classification tasks? When we create a custom dataset, should we insert the text-label pairs following the reading order provided by OCR or it does not matter?

Did you figure out? I have same question, very confused @NielsRogge

AleRosae commented 1 year ago

Hi @YangjiaqiDig, I still don't have a clear answer for this issue, but I have some empirical evidence that the reading order might influence the final performances. I did some experiments with the Kleister-NDA dataset and depending on the quality of the reading order I obtained different results. No idea if it is also the case with FUNSD.

Cheers, Alessandro

NielsRogge commented 1 year ago

The authors recommend providing words in reading order: https://github.com/microsoft/unilm/issues/85#issuecomment-600419220.

But LayoutLMv3 for instance obtains an F1 of 90% on FUNSD even though the words aren't in the correct reading order.

YangjiaqiDig commented 1 year ago

The authors recommend providing words in reading order: microsoft/unilm#85 (comment).

But LayoutLMv3 for instance obtains an F1 of 90% on FUNSD even though the words aren't in the correct reading order.

Thanks. The LayoutLMv3 is tricky... It grouped the tokens under same entity with same bbox, and thats info leaky.

NielsRogge commented 1 year ago

It grouped the tokens under same entity with same bbox, and thats info leaky.

Oh yes fair point. So it's always advised to use segment position embeddings (you can obtain them for instance using Microsoft Read API), this will definitely give you a nice boost in performance

YangjiaqiDig commented 1 year ago

Hi @YangjiaqiDig, I still don't have a clear answer for this issue, but I have some empirical evidence that the reading order might influence the final performances. I did some experiments with the Kleister-NDA dataset and depending on the quality of the reading order I obtained different results. No idea if it is also the case with FUNSD.

Cheers, Alessandro

Thanks for the reply. I have done some experiements with different orders, it seems the performance changed. Have two findings, 1) if we change the order dramatically (like tag order), it doesnt make sense as the inference has no label info, and the token order cannot match the image patch order thus its hard for model to learn. 2) FUNSD the dataset has tokens which is same entity in row B-, I-, I-* as input, but if we follow the OCR reading order, FUNSD input will be changed minorly, for instance B-Question, I-Question, O, I-Question. But the performance is just slight drop with OCR order compared with the original FUNSD order.

YangjiaqiDig commented 1 year ago

It grouped the tokens under same entity with same bbox, and thats info leaky.

Oh yes fair point. So it's always advised to use segment position embeddings (you can obtain them for instance using Microsoft Read API), this will definitely give you a nice boost in performance

Thank you!! I will try that!