Open 420xincheng opened 1 week ago
same problem. 初步调试了下,跟踪信息
[2024/10/30 16:00:52] ppocr ERROR: Traceback: Traceback (most recent call last): File "D:\work\ai\PaddleOCR\ppstructure\predict_system.py", line 390, in main convert_info_docx(img, all_res, save_folder, img_name) File "D:\work\ai\PaddleOCR\ppstructure\recovery\recovery_to_doc.py", line 69, in convert_info_docx parser.handle_table(region["res"]["html"], doc) File "D:\work\ai\PaddleOCR\ppstructure\recovery\table_process.py", line 276, in handle_table cell_to_merge = table.cell( ^^^^^^^^^^^ File "C:\Users\HP.pyenv\pyenv-win\versions\3.11.9\Lib\site-packages\docx\table.py", line 91, in cell return self._cells[cell_idx]
IndexError: list index out of range
按ai代码修改如下
```
def handle_table(self, html, doc):
table_soup = BeautifulSoup(html, "html.parser")
rows, cols_len = get_table_dimensions(table_soup)
table = doc.add_table(len(rows), cols_len)
table.style = doc.styles["Table Grid"]
num_rows = len(table.rows)
num_cols = len(table.columns)
cell_row = 0
for index, row in enumerate(rows):
cols = get_table_columns(row)
cell_col = 0
for col in cols:
colspan = int(col.attrs.get("colspan", 1))
rowspan = int(col.attrs.get("rowspan", 1))
cell_html = get_cell_html(col)
if col.name == "th":
cell_html = "<b>%s</b>" % cell_html
if cell_row >= num_rows or cell_col >= num_cols:
continue
docx_cell = table.cell(cell_row, cell_col)
while cell_col < num_cols and docx_cell.text != "": # Skip the merged cell
cell_col += 1
if cell_col < num_cols: # 确保索引有效
docx_cell = table.cell(cell_row, cell_col)
# 确保合并的单元格索引有效
if (cell_row + rowspan - 1 < num_rows) and (cell_col + colspan - 1 < num_cols):
cell_to_merge = table.cell(cell_row + rowspan - 1, cell_col + colspan - 1)
if docx_cell != cell_to_merge:
docx_cell.merge(cell_to_merge)
child_parser = HtmlToDocx()
child_parser.copy_settings_from(self)
child_parser.add_html_to_cell(cell_html or " ", docx_cell)
cell_col += colspan
cell_row += 1
```
ai分析
1. 检查行数和列数的计算
确保 get_table_dimensions(table_soup) 返回的行和列的数量是有效的。如果 rows 是空的或者 cols_len 的值不正确,可能会导致后续代码中的索引错误。
2. 确保列索引有效
在访问 table.cell(cell_row, cell_col) 时,确保 cell_row 和 cell_col 在有效范围内。你可以在访问之前添加条件检查:
python
if cell_row < num_rows and cell_col < num_cols:
docx_cell = table.cell(cell_row, cell_col)
else:
continue # 或者其他处理逻辑
3. 合并单元格的有效性
在合并单元格之前,确保目标单元格 cell_to_merge 的索引在有效范围内。例如:
python
if (cell_row + rowspan - 1 < num_rows) and (cell_col + colspan - 1 < num_cols):
cell_to_merge = table.cell(cell_row + rowspan - 1, cell_col + colspan - 1)
if docx_cell != cell_to_merge:
docx_cell.merge(cell_to_merge)
修改后可运行
因尚未掌握python debug方法,未深度分析,仅供参考
I've fixed the problem, and just need to update the handle_table method in ppstructure/recovery/table_process.py to the following code.
def handle_table(self, html, doc):
"""
To handle nested tables, we will parse tables manually as follows:
Get table soup
Create docx table
Iterate over soup and fill docx table with new instances of this parser
Tell HTMLParser to ignore any tags until the corresponding closing table tag
"""
table_soup = BeautifulSoup(html, "html.parser")
rows, cols_len = get_table_dimensions(table_soup)
table = doc.add_table(len(rows), cols_len)
table.style = doc.styles["Table Grid"]
num_rows = len(table.rows)
num_cols = len(table.columns)
cell_row = 0
for index, row in enumerate(rows):
cols = get_table_columns(row)
cell_col = 0
for col in cols:
colspan = int(col.attrs.get("colspan", 1))
rowspan = int(col.attrs.get("rowspan", 1))
cell_html = get_cell_html(col)
if col.name == "th":
cell_html = "<b>%s</b>" % cell_html
if cell_row >= num_rows or cell_col >= num_cols:
continue
docx_cell = table.cell(cell_row, cell_col)
while docx_cell.text != "": # Skip the merged cell
cell_col += 1
docx_cell = table.cell(cell_row, cell_col)
cell_to_merge = table.cell(
cell_row + rowspan - 1, cell_col + colspan - 1
)
if docx_cell != cell_to_merge:
docx_cell.merge(cell_to_merge)
child_parser = HtmlToDocx()
child_parser.copy_settings_from(self)
child_parser.add_html_to_cell(cell_html or " ", docx_cell)
cell_col += colspan
cell_row += 1
The difference is :
while docx_cell.text != "": # Skip the merged cell
cell_col += 1
docx_cell = table.cell(cell_row, cell_col)
cell_to_merge = table.cell(
cell_row + rowspan - 1, cell_col + colspan - 1
)
if docx_cell != cell_to_merge:
docx_cell.merge(cell_to_merge)
注意,我使用的paddleocr-release-2.7,最新版本的ocr已经解决了此问题。 问题原因是:在写入某个单元格时,没有跳过已经合并好的单元格。
🔎 Search before asking
🐛 Bug (问题描述)
对pdf进行版面恢复的时候,报错error msg: requested span not rectangular 使用如下命令行对pdf文件进行测试:
能够正常保存res和可视化图像, 在版面恢复保存为docx的时候出错。
🏃♂️ Environment (运行环境)
🌰 Minimal Reproducible Example (最小可复现问题的Demo)
python ./ppstructure/predict_system.py --image_dir="./demo/pdf" --recovery=True --layout_dict_path="./ppocr/utils/dict/layout_dict/layout_cdla_dict.txt" --layout_model_dir="./inference_model/picodet_lcnet_x1_0_fgd_layout_cdla_infer" --det_model_dir="./inference_model/ch_PP-OCRv3_det_infer" --rec_model_dir="./inference_model/ch_PP-OCRv3_rec_infer" --table_model_dir="./inference_model/ch_ppstructure_mobile_v2.0_SLANet_infer" --table_char_dict_path="./ppocr/utils/dict/table_structure_dict_ch.txt"