PaddlePaddle / PaddleOCR

Awesome multilingual OCR toolkits based on PaddlePaddle (practical ultra lightweight OCR system, support 80+ languages recognition, provide data annotation and synthesis tools, support training and deployment among server, mobile, embedded and IoT devices)
https://paddlepaddle.github.io/PaddleOCR/
Apache License 2.0
44.34k stars 7.83k forks source link

LayoutLM class imbalancement issue #9651

Closed inesriahi closed 6 months ago

inesriahi commented 1 year ago

Hi. I'm working with a key information extraction problem using LayoutLM. I am facing overfitting because of data imbalancement. The dataset is labeled such that in one document, out of about 200 tokens in some cases, there is one segment labeled as TITLE, another ID, PAGE_NUM, etc, while the rest of the segments have the class OTHER.

Any recommendation for dealing with this problem? I'm currently using LayoutLM from HuggingFace, but I'm considering moving to paddle, so I'd like to know how can this be tackled.

Thank you

UserWangZz commented 6 months ago

This issue has not been updated for a long time. This issue is temporarily closed and can be reopened if necessary.