LayoutLM class imbalancement issue

PaddlePaddle / PaddleOCR

Awesome multilingual OCR toolkits based on PaddlePaddle (practical ultra lightweight OCR system, support 80+ languages recognition, provide data annotation and synthesis tools, support training and deployment among server, mobile, embedded and IoT devices)

Apache License 2.0

44.34k stars 7.83k forks source link

Hi. I'm working with a key information extraction problem using LayoutLM. I am facing overfitting because of data imbalancement. The dataset is labeled such that in one document, out of about 200 tokens in some cases, there is one segment labeled as TITLE, another ID, PAGE_NUM, etc, while the rest of the segments have the class OTHER.

Any recommendation for dealing with this problem? I'm currently using LayoutLM from HuggingFace, but I'm considering moving to paddle, so I'd like to know how can this be tackled.

Thank you

PaddlePaddle / PaddleOCR

LayoutLM class imbalancement issue #9651