PaddlePaddle / PaddleOCR

Awesome multilingual OCR toolkits based on PaddlePaddle (practical ultra lightweight OCR system, support 80+ languages recognition, provide data annotation and synthesis tools, support training and deployment among server, mobile, embedded and IoT devices)
Apache License 2.0
38.99k stars 7.32k forks source link

PPStructure: unable to recognize a fairly easy structure #12036

Closed vlavorini closed 6 days ago

vlavorini commented 2 weeks ago

I am trying to parse this PDF using PaddleOCR 2.7.3.

I tried converting the pages as images, and then run PPStructure on them. I tried with the following options:

engine = PPStructure(show_log=True, image_orientation=True)

engine = PPStructure(show_log=True, image_orientation=True, lan='en')

engine = PPStructure((show_log=True, image_orientation=True, lan='en',  layout_model_dir=./picodet_lcnet_x1_0_fgd_layout_infer',  layout_dict_path='./layout_publaynet_dict.txt',)

but the results in the second page of the document are not satisfactory: page_1

I also tried with the model ppyolov2_r50vd_dcn_365e_publaynet:

engine = PPStructure(show_log=True, image_orientation=True, lan='en',
                      layout_model_dir='./ppyolov2_r50vd_dcn_365e_publaynet', 
                       layout_dict_path=./layout_publaynet_dict.txt',

but the program stops at an error: InvalidArgumentError: The size of Op(Conv) inputs should not be 0.

Any suggestion on how to correctly parse this pdf?

Thank you!

TingquanGao commented 1 week ago

What is the PaddlePaddle version used?

vlavorini commented 1 week ago

PaddlePaddle == 2.6.1 PaddleOCR == 2.7.3

TingquanGao commented 1 week ago

The dataset to train PPStructure models lack of such data. So the models need to be finetuned.