PaddlePaddle / PaddleOCR

Awesome multilingual OCR toolkits based on PaddlePaddle (practical ultra lightweight OCR system, support 80+ languages recognition, provide data annotation and synthesis tools, support training and deployment among server, mobile, embedded and IoT devices)
https://paddlepaddle.github.io/PaddleOCR/
Apache License 2.0
42.4k stars 7.65k forks source link

PPStructure missing text that PaddleOCR do not miss #13393

Open omeruth opened 1 month ago

omeruth commented 1 month ago

问题描述 / Problem Description

PPStructure missing text that PaddleOCR do not miss

运行环境 / Runtime Environment

复现代码 / Reproduction Code

PaddleOCR(lang='en', use_angle_cls=True, use_gpu=True)

PPStructure(show_log=True, image_orientation=True, structure_version='PP-StructureV2',recovery=True)

完整报错 / Complete Error Message

PaddleOCR output is good but when it comes to tables it messes up. So I thought to use PPStructure which gives very good results for tables as well. But I noticed it has tendency to miss certain parts of the documents completely where PaddleOCR works fine. Is there a possibility to use PaddleOCR to extract text + tables separately without using PPStructure. Or is there a way that helps PPstructure do not miss any text just like PaddleOCR? Thanks

可能解决方案 / Possible solutions

附件 / Appendix

GreatV commented 1 month ago

What are your runtime environment details, such as paddle and paddleocr versions?

omeruth commented 1 month ago

import paddle print(paddle.version) 2.6.1 import paddleocr print(paddleocr.version) 2.8.0

python --version Python 3.10.13 Ubuntu 20.04.6 LTS

GreatV commented 1 month ago

Hi @omeruth, could you provide an example image so I can replicate the result?

omeruth commented 1 month ago

I am applying on medical records but couldn't share it. I can tell you that PaddleOCR works fine but PPStructure miss the text for the same document. And I observed it the same on some other documents as well. Medical records can have tables, text, one cloumn, two column, images, logos etc. And these are scanned documents as images