PaddlePaddle / PaddleOCR

Awesome multilingual OCR toolkits based on PaddlePaddle (practical ultra lightweight OCR system, support 80+ languages recognition, provide data annotation and synthesis tools, support training and deployment among server, mobile, embedded and IoT devices)
https://paddlepaddle.github.io/PaddleOCR/
Apache License 2.0
43.74k stars 7.79k forks source link

PP-Structure的版面恢复,会将多个段落合并为一段 #14094

Open Pumpkinhn opened 1 day ago

Pumpkinhn commented 1 day ago

🔎 Search before asking

🐛 Bug (问题描述)

使用版面恢复功能时,会把多个自然段落识别为一块,然后在生成的word中显示为完整的一段,缺少原文件的分段结构化信息,这个问题有没有参数可以调试?或者有没有在开发计划中?

🏃‍♂️ Environment (运行环境)

windows python 3.11 paddleocr==2.8.1 paddlepaddle==2.5.2

🌰 Minimal Reproducible Example (最小可复现问题的Demo)

非论文的常规文档

GreatV commented 1 day ago

麻烦给出详细的描述和样例。

Pumpkinhn commented 1 day ago

例如这页扫描件: d72a3ca13f0c74eecbccd50402ebc5d 转化结果如下图: 图片1 几个小段落连接成了一段。

GreatV commented 1 day ago

你是用什么命令或者代码转换的呀

Pumpkinhn commented 1 day ago

使用的PP-Structure的pdf转word的相关代码

    # 初始化 OCR 引擎
    table_engine = PPStructure(
        recovery=True,
        use_gpu=False,
        det_model_dir='supports/ppstructure/inference/ch_PP-OCRv4_det_infer',
        # det_model_dir='supports/ppstructure/inference/ch_PP-OCRv4_det_server_infer',
        rec_model_dir='supports/ppstructure/inference/ch_PP-OCRv4_rec_infer',
        # rec_model_dir='supports/ppstructure/inference/ch_PP-OCRv4_rec_server_infer',
        rec_char_dict_path='supports/ppstructure/inference/ppocr_keys_v1.txt',
        table_model_dir='supports/ppstructure/inference/ch_ppstructure_mobile_v2.0_SLANet_infer',
        layout_model_dir='supports/ppstructure/inference/picodet_lcnet_x1_0_fgd_layout_cdla_infer',
        layout_dict_path='supports/ppstructure/inference/layout_cdla_dict.txt'
    )

    doc = fitz.open(pdf_path)

    for page_number in range(doc.page_count):
        # 将 PDF 转换为图像
        page = doc.load_page(page_number)
        pix = page.get_pixmap()

        page_num_str = str.zfill(str(page_number), 6)
        image_path = os.path.join(temp_images_dir, f"page_{page_num_str}.png")
        pix.save(image_path)

        # 读取图像并进行 OCR
        img = cv2.imread(image_path)
        result = table_engine(img)

        # 保存 OCR 结果
        save_structure_res(result, save_folder, f"page_{page_num_str}")

        # 排序并转换为 DOCX
        h, w, _ = img.shape
        res = sorted_layout_boxes(filtered_result, w)
        convert_info_docx(img, res, save_folder, f"page_{page_num_str}")