Awesome multilingual OCR toolkits based on PaddlePaddle (practical ultra lightweight OCR system, support 80+ languages recognition, provide data annotation and synthesis tools, support training and deployment among server, mobile, embedded and IoT devices)
And for getting text from pdf file i'm using this code below with this tables are also converting into text which i don't want.
from paddleocr import PaddleOCR, draw_ocr
Paddleocr supports Chinese, English, French, German, Korean and Japanese.
You can set the parameter lang as ch, en, fr, german, korean, japan to switch the language model in order.
ocr = PaddleOCR(use_angle_cls=True, lang="en", page_num=0) # need to run only once to download and load model into memory
img_path = 'tables/example.pdf'
result = ocr.ocr(img_path, cls=True)
def ocr_to_txt(result):
text= ""
for line in result:
for word in line:
text += word[1][0] + " "
text += "\n"
return text
text = ocr_to_txt(result)
with open ("ocr_results.txt", "w") as f:
f.write(text)
Currently for getting tables i'm using this part of code for getting tables as excel file.(2.2.4 table recognition)
import os import cv2 import PIL import paddleclas import paddle from paddleocr import PPStructure,draw_structure_result,save_structure_res
table_engine = PPStructure(layout=False, show_log=True) # table recognition
save_folder = 'output' img_path = 'example.png' img = cv2.imread(img_path) result = table_engine(img) save_structure_res(result, save_folder, os.path.basename(img_path).split('.')[0])
And for getting text from pdf file i'm using this code below with this tables are also converting into text which i don't want.
from paddleocr import PaddleOCR, draw_ocr
Paddleocr supports Chinese, English, French, German, Korean and Japanese.
You can set the parameter
lang
asch
,en
,fr
,german
,korean
,japan
to switch the language model in order.ocr = PaddleOCR(use_angle_cls=True, lang="en", page_num=0) # need to run only once to download and load model into memory img_path = 'tables/example.pdf' result = ocr.ocr(img_path, cls=True)
def ocr_to_txt(result): text= "" for line in result: for word in line: text += word[1][0] + " " text += "\n" return text text = ocr_to_txt(result)
with open ("ocr_results.txt", "w") as f: f.write(text)