PaddlePaddle / PaddleOCR

Awesome multilingual OCR toolkits based on PaddlePaddle (practical ultra lightweight OCR system, support 80+ languages recognition, provide data annotation and synthesis tools, support training and deployment among server, mobile, embedded and IoT devices)
https://paddlepaddle.github.io/PaddleOCR/
Apache License 2.0
42.47k stars 7.66k forks source link

can i get tables and text from pdf separately with PaddleOCR ? #11959

Closed jay22mehta closed 3 months ago

jay22mehta commented 4 months ago

Currently for getting tables i'm using this part of code for getting tables as excel file.(2.2.4 table recognition)

import os import cv2 import PIL import paddleclas import paddle from paddleocr import PPStructure,draw_structure_result,save_structure_res

table_engine = PPStructure(layout=False, show_log=True) # table recognition

save_folder = 'output' img_path = 'example.png' img = cv2.imread(img_path) result = table_engine(img) save_structure_res(result, save_folder, os.path.basename(img_path).split('.')[0])

And for getting text from pdf file i'm using this code below with this tables are also converting into text which i don't want.

from paddleocr import PaddleOCR, draw_ocr

Paddleocr supports Chinese, English, French, German, Korean and Japanese.

You can set the parameter lang as ch, en, fr, german, korean, japan to switch the language model in order.

ocr = PaddleOCR(use_angle_cls=True, lang="en", page_num=0) # need to run only once to download and load model into memory img_path = 'tables/example.pdf' result = ocr.ocr(img_path, cls=True)

def ocr_to_txt(result): text= "" for line in result: for word in line: text += word[1][0] + " " text += "\n" return text text = ocr_to_txt(result)

with open ("ocr_results.txt", "w") as f: f.write(text)

Sunting78 commented 4 months ago

You can use Layout analysis to judge text and table region. refer to https://github.com/PaddlePaddle/PaddleOCR/blob/2b3b3554c05ae615ed7eb051c2ac7c6bb8bc985d/ppstructure/README.md