PaddlePaddle / PaddleOCR

Awesome multilingual OCR toolkits based on PaddlePaddle (practical ultra lightweight OCR system, support 80+ languages recognition, provide data annotation and synthesis tools, support training and deployment among server, mobile, embedded and IoT devices)
https://paddlepaddle.github.io/PaddleOCR/
Apache License 2.0
42.66k stars 7.67k forks source link

ppstructure的中文文字识别结果是unicode字符。 #13720

Closed greatliu closed 1 week ago

greatliu commented 3 weeks ago

🔎 Search before asking

🐛 Bug (问题描述)

ppocr2.8.1,ppstructure的中文文字识别结果是unicode字符(保存的res文件,print的result均是)。

我搜索了,前一阵子有类似问题,但都没有解决。 https://github.com/PaddlePaddle/PaddleOCR/issues/10790

🏃‍♂️ Environment (运行环境)

win11,gtx3080

🌰 Minimal Reproducible Example (最小可复现问题的Demo)

import os import cv2 from paddleocr import PPStructure,save_structure_res from paddle.utils import try_import import numpy as np from PIL import Image from paddleocr.ppstructure.recovery.recovery_to_doc import sorted_layout_boxes, convert_info_docx

中文测试图

table_engine = PPStructure(recovery=True)

英文测试图

table_engine = PPStructure(recovery=True, lang='en')

save_folder = 'E:\Workspace\GitHub\PaddleOCR-2.8.1\output' img_path = 'E:\合同.pdf'

fitz = try_import("fitz") imgs = [] with fitz.open(img_path) as pdf: for pg in range(0, pdf.page_count): page = pdf[pg] mat = fitz.Matrix(2, 2) pm = page.get_pixmap(matrix=mat, alpha=False)

    # if width or height > 2000 pixels, don't enlarge the image
    if pm.width > 2000 or pm.height > 2000:
        pm = page.get_pixmap(matrix=fitz.Matrix(1, 1), alpha=False)

    img = Image.frombytes("RGB", [pm.width, pm.height], pm.samples)
    img = cv2.cvtColor(np.array(img), cv2.COLOR_RGB2BGR)
    imgs.append(img)

for index, img in enumerate(imgs): result = table_engine(img) save_structure_res(result, save_folder, os.path.basename(imgpath).split('.')[0], index) for line in result: line.pop('img') print(line) h, w, = img.shape res = sorted_layout_boxes(result, w)

hot-vs-cool commented 2 weeks ago

找到自己所用python环境的paddleocr安装路径(例如:anaconda3/envs/paddle-env/lib/python3.12/site-packages/paddleocr/),将其中ppstructure/predict_system.py文件中出现json.dumps()函数调用的参数中加上ensure_ascii=False,即可解决问题。在你的代码中,save_structure_res()函数中会调用json.dumps()。

greatliu commented 2 weeks ago

找到自己所用python环境的paddleocr安装路径(例如:anaconda3/envs/paddle-env/lib/python3.12/site-packages/paddleocr/),将其中ppstructure/predict_system.py文件中出现json.dumps()函数调用的参数中加上ensure_ascii=False,即可解决问题。在你的代码中,save_structure_res()函数中会调用json.dumps()。

谢谢,我找机会试一下。 像这种需要更改源代码的,是不是做成一个外置的参数更合适?