PPStructure: unable to recognize a fairly easy structure

vlavorini commented 2 weeks ago

I am trying to parse this PDF using PaddleOCR 2.7.3.

I tried converting the pages as images, and then run PPStructure on them. I tried with the following options:

engine = PPStructure(show_log=True, image_orientation=True)

engine = PPStructure(show_log=True, image_orientation=True, lan='en')

engine = PPStructure((show_log=True, image_orientation=True, lan='en',  layout_model_dir=./picodet_lcnet_x1_0_fgd_layout_infer',  layout_dict_path='./layout_publaynet_dict.txt',)

but the results in the second page of the document are not satisfactory: page_1

I also tried with the model ppyolov2_r50vd_dcn_365e_publaynet:

engine = PPStructure(show_log=True, image_orientation=True, lan='en',
                      layout_model_dir='./ppyolov2_r50vd_dcn_365e_publaynet', 
                       layout_dict_path=./layout_publaynet_dict.txt',

but the program stops at an error: InvalidArgumentError: The size of Op(Conv) inputs should not be 0.

Any suggestion on how to correctly parse this pdf?

Thank you!

TingquanGao commented 1 week ago

What is the PaddlePaddle version used?

vlavorini commented 1 week ago

PaddlePaddle == 2.6.1 PaddleOCR == 2.7.3

TingquanGao commented 1 week ago

The dataset to train PPStructure models lack of such data. So the models need to be finetuned.

PaddlePaddle / PaddleOCR

PPStructure: unable to recognize a fairly easy structure #12036