PP-Structure的版面恢复，会将多个段落合并为一段

Pumpkinhn commented 1 day ago

🔎 Search before asking

[X] I have searched the PaddleOCR Docs and found no similar bug report.
[X] I have searched the PaddleOCR Issues and found no similar bug report.
[X] I have searched the PaddleOCR Discussions and found no similar bug report.

🐛 Bug (问题描述)

使用版面恢复功能时，会把多个自然段落识别为一块，然后在生成的word中显示为完整的一段，缺少原文件的分段结构化信息，这个问题有没有参数可以调试？或者有没有在开发计划中？

🏃‍♂️ Environment (运行环境)

windows python 3.11 paddleocr==2.8.1 paddlepaddle==2.5.2

🌰 Minimal Reproducible Example (最小可复现问题的Demo)

非论文的常规文档

GreatV commented 1 day ago

麻烦给出详细的描述和样例。

Pumpkinhn commented 1 day ago

例如这页扫描件： d72a3ca13f0c74eecbccd50402ebc5d 转化结果如下图：几个小段落连接成了一段。

GreatV commented 1 day ago

你是用什么命令或者代码转换的呀

Pumpkinhn commented 1 day ago

使用的PP-Structure的pdf转word的相关代码

    # 初始化 OCR 引擎
    table_engine = PPStructure(
        recovery=True,
        use_gpu=False,
        det_model_dir='supports/ppstructure/inference/ch_PP-OCRv4_det_infer',
        # det_model_dir='supports/ppstructure/inference/ch_PP-OCRv4_det_server_infer',
        rec_model_dir='supports/ppstructure/inference/ch_PP-OCRv4_rec_infer',
        # rec_model_dir='supports/ppstructure/inference/ch_PP-OCRv4_rec_server_infer',
        rec_char_dict_path='supports/ppstructure/inference/ppocr_keys_v1.txt',
        table_model_dir='supports/ppstructure/inference/ch_ppstructure_mobile_v2.0_SLANet_infer',
        layout_model_dir='supports/ppstructure/inference/picodet_lcnet_x1_0_fgd_layout_cdla_infer',
        layout_dict_path='supports/ppstructure/inference/layout_cdla_dict.txt'
    )

    doc = fitz.open(pdf_path)

    for page_number in range(doc.page_count):
        # 将 PDF 转换为图像
        page = doc.load_page(page_number)
        pix = page.get_pixmap()

        page_num_str = str.zfill(str(page_number), 6)
        image_path = os.path.join(temp_images_dir, f"page_{page_num_str}.png")
        pix.save(image_path)

        # 读取图像并进行 OCR
        img = cv2.imread(image_path)
        result = table_engine(img)

        # 保存 OCR 结果
        save_structure_res(result, save_folder, f"page_{page_num_str}")

        # 排序并转换为 DOCX
        h, w, _ = img.shape
        res = sorted_layout_boxes(filtered_result, w)
        convert_info_docx(img, res, save_folder, f"page_{page_num_str}")

PaddlePaddle / PaddleOCR

PP-Structure的版面恢复，会将多个段落合并为一段 #14094

🔎 Search before asking

🐛 Bug (问题描述)

🏃‍♂️ Environment (运行环境)

🌰 Minimal Reproducible Example (最小可复现问题的Demo)