Why layoutlm models does not work properly on unstructured text images and is there any way to do it properly ?

NielsRogge / Transformers-Tutorials

This repository contains demos I made with the Transformers library by HuggingFace.

MIT License

9.49k stars 1.45k forks source link

Here's my attempt at a reason:

Unlike typical NER models that are trained on the entire text (limited to 512 tokens ofc) of the document, the LayoutLMV2 model forms context only based on the patch that you have annotated (most probably less than 512 tokens). The result is, that this works well for forms and other structured documents where there are rich visually distinctive features that can help the model identify your entities of interest. With unstructured prose, the minimal visually distinctive features and smaller contexts cause the model to not converge as per your expectations.

Maybe for your use case, a simple Token or Span classification model would do?

NielsRogge / Transformers-Tutorials

Why layoutlm models does not work properly on unstructured text images and is there any way to do it properly ? #136