Open AleRosae opened 2 years ago
@NielsRogge
Hi, thank you very much for all the work that you have done, it is a huge help :) I have noticed that the way in which text is preprocessed for LayoutLMv1 (and I assume also for further versions) does not take into account the reading order. For instance, for example image that it is shown at the beginning of this notebook, the output in
train.txt
is:R&D O : S-QUESTION Suggestion: S-QUESTION Date: S-QUESTION Licensee S-ANSWER Yes S-QUESTION No S-QUESTION 597005708 O R&D B-HEADER QUALITY I-HEADER IMPROVEMENT I-HEADER SUGGESTION/ I-HEADER SOLUTION I-HEADER FORM E-HEADER [..etc]
but what I assume is the correct reading order should be something like:
R&D B-HEADER QUALITY I-HEADER IMPROVEMENT I-HEADER SUGGESTION/ I-HEADER SOLUTION I-HEADER FORM E-HEADER NAME B-QUESTION PHONE I-QUESTION EXT E-QUESTION M. B-ANSWER HAMANN E-ANSWER [...etc]
Is this irrelevant for training LayoutLM for token classification tasks? When we create a custom dataset, should we insert the text-label pairs following the reading order provided by OCR or it does not matter?
Did you figure out? I have same question, very confused @NielsRogge
Hi @YangjiaqiDig, I still don't have a clear answer for this issue, but I have some empirical evidence that the reading order might influence the final performances. I did some experiments with the Kleister-NDA dataset and depending on the quality of the reading order I obtained different results. No idea if it is also the case with FUNSD.
Cheers, Alessandro
The authors recommend providing words in reading order: https://github.com/microsoft/unilm/issues/85#issuecomment-600419220.
But LayoutLMv3 for instance obtains an F1 of 90% on FUNSD even though the words aren't in the correct reading order.
The authors recommend providing words in reading order: microsoft/unilm#85 (comment).
But LayoutLMv3 for instance obtains an F1 of 90% on FUNSD even though the words aren't in the correct reading order.
Thanks. The LayoutLMv3 is tricky... It grouped the tokens under same entity with same bbox, and thats info leaky.
It grouped the tokens under same entity with same bbox, and thats info leaky.
Oh yes fair point. So it's always advised to use segment position embeddings (you can obtain them for instance using Microsoft Read API), this will definitely give you a nice boost in performance
Hi @YangjiaqiDig, I still don't have a clear answer for this issue, but I have some empirical evidence that the reading order might influence the final performances. I did some experiments with the Kleister-NDA dataset and depending on the quality of the reading order I obtained different results. No idea if it is also the case with FUNSD.
Cheers, Alessandro
Thanks for the reply. I have done some experiements with different orders, it seems the performance changed. Have two findings, 1) if we change the order dramatically (like tag order), it doesnt make sense as the inference has no label info, and the token order cannot match the image patch order thus its hard for model to learn. 2) FUNSD the dataset has tokens which is same entity in row B-, I-, I-* as input, but if we follow the OCR reading order, FUNSD input will be changed minorly, for instance B-Question, I-Question, O, I-Question. But the performance is just slight drop with OCR order compared with the original FUNSD order.
It grouped the tokens under same entity with same bbox, and thats info leaky.
Oh yes fair point. So it's always advised to use segment position embeddings (you can obtain them for instance using Microsoft Read API), this will definitely give you a nice boost in performance
Thank you!! I will try that!
Hi, thank you very much for all the work that you have done, it is a huge help :) I have noticed that the way in which text is preprocessed for LayoutLMv1 (and I assume also for further versions) does not take into account the reading order. For instance, for example image that it is shown at the beginning of this notebook, the output in
train.txt
is:but what I assume is the correct reading order should be something like:
Is this irrelevant for training LayoutLM for token classification tasks? When we create a custom dataset, should we insert the text-label pairs following the reading order provided by OCR or it does not matter?