PPStructure missing text that PaddleOCR do not miss

omeruth commented 1 month ago

问题描述 / Problem Description

运行环境 / Runtime Environment

OS:
Paddle:
PaddleOCR:

复现代码 / Reproduction Code

PaddleOCR(lang='en', use_angle_cls=True, use_gpu=True)

PPStructure(show_log=True, image_orientation=True, structure_version='PP-StructureV2',recovery=True)

完整报错 / Complete Error Message

PaddleOCR output is good but when it comes to tables it messes up. So I thought to use PPStructure which gives very good results for tables as well. But I noticed it has tendency to miss certain parts of the documents completely where PaddleOCR works fine. Is there a possibility to use PaddleOCR to extract text + tables separately without using PPStructure. Or is there a way that helps PPstructure do not miss any text just like PaddleOCR? Thanks

可能解决方案 / Possible solutions

附件 / Appendix

GreatV commented 1 month ago

What are your runtime environment details, such as paddle and paddleocr versions?

omeruth commented 1 month ago

import paddle print(paddle.version) 2.6.1 import paddleocr print(paddleocr.version) 2.8.0

python --version Python 3.10.13 Ubuntu 20.04.6 LTS

GreatV commented 1 month ago

Hi @omeruth, could you provide an example image so I can replicate the result?

omeruth commented 1 month ago

I am applying on medical records but couldn't share it. I can tell you that PaddleOCR works fine but PPStructure miss the text for the same document. And I observed it the same on some other documents as well. Medical records can have tables, text, one cloumn, two column, images, logos etc. And these are scanned documents as images

PaddlePaddle / PaddleOCR