PaddlePaddle / PaddleOCR

Awesome multilingual OCR toolkits based on PaddlePaddle (practical ultra lightweight OCR system, support 80+ languages recognition, provide data annotation and synthesis tools, support training and deployment among server, mobile, embedded and IoT devices)
https://paddlepaddle.github.io/PaddleOCR/
Apache License 2.0
43.97k stars 7.8k forks source link

ppocr ERROR: error in layout recovery image: ./demo/pdf\陕西省“十三五”生态环境保护规划.pdf, err msg: requested span not rectangular #14099

Open 420xincheng opened 1 week ago

420xincheng commented 1 week ago

🔎 Search before asking

🐛 Bug (问题描述)

对pdf进行版面恢复的时候,报错error msg: requested span not rectangular 使用如下命令行对pdf文件进行测试:

python ./ppstructure/predict_system.py --image_dir="./demo/pdf" --recovery=True --layout_dict_path="./ppocr/utils/dict/layout_dict/layout_cdla_dict.txt" --layout_model_dir="./inference_model/picodet_lcnet_x1_0_fgd_layout_cdla_infer" --det_model_dir="./inference_model/ch_PP-OCRv3_det_infer" --rec_model_dir="./inference_model/ch_PP-OCRv3_rec_infer" --table_model_dir="./inference_model/ch_ppstructure_mobile_v2.0_SLANet_infer" --table_char_dict_path="./ppocr/utils/dict/table_structure_dict_ch.txt"

能够正常保存res和可视化图像, 在版面恢复保存为docx的时候出错。

[2024/10/25 19:22:08] ppocr DEBUG: dt_boxes num : 21, elapsed : 0.026086091995239258
[2024/10/25 19:22:08] ppocr DEBUG: rec_res num  : 21, elapsed : 0.1400156021118164
[2024/10/25 19:22:08] ppocr DEBUG: dt_boxes num : 1, elapsed : 0.023510456085205078
[2024/10/25 19:22:08] ppocr DEBUG: rec_res num  : 1, elapsed : 0.004541158676147461
[2024/10/25 19:22:08] ppocr INFO: result save to ./output\structure\xx\show_31.jpg

🏃‍♂️ Environment (运行环境)

Windows 11
Python 3.8.18
paddleocr-release-2.7

🌰 Minimal Reproducible Example (最小可复现问题的Demo)

python ./ppstructure/predict_system.py --image_dir="./demo/pdf" --recovery=True --layout_dict_path="./ppocr/utils/dict/layout_dict/layout_cdla_dict.txt" --layout_model_dir="./inference_model/picodet_lcnet_x1_0_fgd_layout_cdla_infer" --det_model_dir="./inference_model/ch_PP-OCRv3_det_infer" --rec_model_dir="./inference_model/ch_PP-OCRv3_rec_infer" --table_model_dir="./inference_model/ch_ppstructure_mobile_v2.0_SLANet_infer" --table_char_dict_path="./ppocr/utils/dict/table_structure_dict_ch.txt"

freeng commented 5 days ago

same problem. 初步调试了下,跟踪信息

[2024/10/30 16:00:52] ppocr ERROR: Traceback: Traceback (most recent call last): File "D:\work\ai\PaddleOCR\ppstructure\predict_system.py", line 390, in main convert_info_docx(img, all_res, save_folder, img_name) File "D:\work\ai\PaddleOCR\ppstructure\recovery\recovery_to_doc.py", line 69, in convert_info_docx parser.handle_table(region["res"]["html"], doc) File "D:\work\ai\PaddleOCR\ppstructure\recovery\table_process.py", line 276, in handle_table cell_to_merge = table.cell( ^^^^^^^^^^^ File "C:\Users\HP.pyenv\pyenv-win\versions\3.11.9\Lib\site-packages\docx\table.py", line 91, in cell return self._cells[cell_idx]


IndexError: list index out of range

按ai代码修改如下
```

def handle_table(self, html, doc):
    table_soup = BeautifulSoup(html, "html.parser")
    rows, cols_len = get_table_dimensions(table_soup)
    table = doc.add_table(len(rows), cols_len)
    table.style = doc.styles["Table Grid"]

    num_rows = len(table.rows)
    num_cols = len(table.columns)

    cell_row = 0
    for index, row in enumerate(rows):
        cols = get_table_columns(row)
        cell_col = 0
        for col in cols:
            colspan = int(col.attrs.get("colspan", 1))
            rowspan = int(col.attrs.get("rowspan", 1))

            cell_html = get_cell_html(col)
            if col.name == "th":
                cell_html = "<b>%s</b>" % cell_html

            if cell_row >= num_rows or cell_col >= num_cols:
                continue

            docx_cell = table.cell(cell_row, cell_col)

            while cell_col < num_cols and docx_cell.text != "":  # Skip the merged cell
                cell_col += 1
                if cell_col < num_cols:  # 确保索引有效
                    docx_cell = table.cell(cell_row, cell_col)

            # 确保合并的单元格索引有效
            if (cell_row + rowspan - 1 < num_rows) and (cell_col + colspan - 1 < num_cols):
                cell_to_merge = table.cell(cell_row + rowspan - 1, cell_col + colspan - 1)
                if docx_cell != cell_to_merge:
                    docx_cell.merge(cell_to_merge)

            child_parser = HtmlToDocx()
            child_parser.copy_settings_from(self)
            child_parser.add_html_to_cell(cell_html or " ", docx_cell)

            cell_col += colspan
        cell_row += 1
```

ai分析
1. 检查行数和列数的计算

确保 get_table_dimensions(table_soup) 返回的行和列的数量是有效的。如果 rows 是空的或者 cols_len 的值不正确,可能会导致后续代码中的索引错误。
2. 确保列索引有效

在访问 table.cell(cell_row, cell_col) 时,确保 cell_row 和 cell_col 在有效范围内。你可以在访问之前添加条件检查:

python

if cell_row < num_rows and cell_col < num_cols:
    docx_cell = table.cell(cell_row, cell_col)
else:
    continue  # 或者其他处理逻辑

3. 合并单元格的有效性

在合并单元格之前,确保目标单元格 cell_to_merge 的索引在有效范围内。例如:

python

if (cell_row + rowspan - 1 < num_rows) and (cell_col + colspan - 1 < num_cols):
    cell_to_merge = table.cell(cell_row + rowspan - 1, cell_col + colspan - 1)
    if docx_cell != cell_to_merge:
        docx_cell.merge(cell_to_merge)

修改后可运行

因尚未掌握python debug方法,未深度分析,仅供参考
420xincheng commented 1 day ago

I've fixed the problem, and just need to update the handle_table method in ppstructure/recovery/table_process.py to the following code.

def handle_table(self, html, doc):
        """
        To handle nested tables, we will parse tables manually as follows:
        Get table soup
        Create docx table
        Iterate over soup and fill docx table with new instances of this parser
        Tell HTMLParser to ignore any tags until the corresponding closing table tag
        """
        table_soup = BeautifulSoup(html, "html.parser")
        rows, cols_len = get_table_dimensions(table_soup)
        table = doc.add_table(len(rows), cols_len)
        table.style = doc.styles["Table Grid"]

        num_rows = len(table.rows)
        num_cols = len(table.columns)

        cell_row = 0
        for index, row in enumerate(rows):
            cols = get_table_columns(row)
            cell_col = 0
            for col in cols:
                colspan = int(col.attrs.get("colspan", 1))
                rowspan = int(col.attrs.get("rowspan", 1))

                cell_html = get_cell_html(col)
                if col.name == "th":
                    cell_html = "<b>%s</b>" % cell_html

                if cell_row >= num_rows or cell_col >= num_cols:
                    continue

                docx_cell = table.cell(cell_row, cell_col)

                while docx_cell.text != "":  # Skip the merged cell
                    cell_col += 1
                    docx_cell = table.cell(cell_row, cell_col)

                cell_to_merge = table.cell(
                    cell_row + rowspan - 1, cell_col + colspan - 1
                )
                if docx_cell != cell_to_merge:
                    docx_cell.merge(cell_to_merge)

                child_parser = HtmlToDocx()
                child_parser.copy_settings_from(self)
                child_parser.add_html_to_cell(cell_html or " ", docx_cell)

                cell_col += colspan
            cell_row += 1

The difference is :

while docx_cell.text != "":  # Skip the merged cell
    cell_col += 1
    docx_cell = table.cell(cell_row, cell_col)

cell_to_merge = table.cell(
    cell_row + rowspan - 1, cell_col + colspan - 1
)
if docx_cell != cell_to_merge:
    docx_cell.merge(cell_to_merge)

注意,我使用的paddleocr-release-2.7,最新版本的ocr已经解决了此问题。 问题原因是:在写入某个单元格时,没有跳过已经合并好的单元格。