greatliu commented 3 months ago

🔎 Search before asking

[X] I have searched the PaddleOCR Docs and found no similar bug report.
[X] I have searched the PaddleOCR Issues and found no similar bug report.
[X] I have searched the PaddleOCR Discussions and found no similar bug report.

🐛 Bug (问题描述)

ppocr2.8.1，ppstructure的中文文字识别结果是unicode字符（保存的res文件，print的result均是）。

我搜索了，前一阵子有类似问题，但都没有解决。 https://github.com/PaddlePaddle/PaddleOCR/issues/10790

🏃‍♂️ Environment (运行环境)

win11，gtx3080

🌰 Minimal Reproducible Example (最小可复现问题的Demo)

import os import cv2 from paddleocr import PPStructure,save_structure_res from paddle.utils import try_import import numpy as np from PIL import Image from paddleocr.ppstructure.recovery.recovery_to_doc import sorted_layout_boxes, convert_info_docx

中文测试图

table_engine = PPStructure(recovery=True)

英文测试图

table_engine = PPStructure(recovery=True, lang='en')

save_folder = 'E:\Workspace\GitHub\PaddleOCR-2.8.1\output' img_path = 'E:\合同.pdf'

fitz = try_import("fitz") imgs = [] with fitz.open(img_path) as pdf: for pg in range(0, pdf.page_count): page = pdf[pg] mat = fitz.Matrix(2, 2) pm = page.get_pixmap(matrix=mat, alpha=False)

    # if width or height > 2000 pixels, don't enlarge the image
    if pm.width > 2000 or pm.height > 2000:
        pm = page.get_pixmap(matrix=fitz.Matrix(1, 1), alpha=False)

    img = Image.frombytes("RGB", [pm.width, pm.height], pm.samples)
    img = cv2.cvtColor(np.array(img), cv2.COLOR_RGB2BGR)
    imgs.append(img)

for index, img in enumerate(imgs): result = table_engine(img) save_structure_res(result, save_folder, os.path.basename(imgpath).split('.')[0], index) for line in result: line.pop('img') print(line) h, w, = img.shape res = sorted_layout_boxes(result, w)

hot-vs-cool commented 2 months ago

找到自己所用python环境的paddleocr安装路径（例如:anaconda3/envs/paddle-env/lib/python3.12/site-packages/paddleocr/），将其中ppstructure/predict_system.py文件中出现json.dumps()函数调用的参数中加上ensure_ascii=False，即可解决问题。在你的代码中，save_structure_res()函数中会调用json.dumps()。

greatliu commented 2 months ago