Unstructured-IO / unstructured

Open source libraries and APIs to build custom preprocessing pipelines for labeling, training, or production machine learning pipelines.
https://www.unstructured.io/
Apache License 2.0
7.54k stars 595 forks source link

bug/docx parse table without row.grid_cols_before or row.grid_cols_after #3145

Closed antfin closed 1 month ago

antfin commented 1 month ago

Describe the bug Issue parsing 5G 3GPP spec (e.g. https://www.3gpp.org/ftp/Specs/archive/23_series/23.503/23503-i50.zip from https://portal.3gpp.org/desktopmodules/Specifications/SpecificationDetails.aspx?specificationId=3334)

To Reproduce Try to parse the document

from unstructured.partition.docx import partition_docx

file_path = '23503-i50.docx'
elements = partition_docx(file_path)

Expected behavior Document parsed without error

Screenshots We have this exception AttributeError: '_Row' object has no attribute 'grid_cols_before'

Environment Info I'm using Python 3.11.8 in my Mac

Additional context Commenting in unstructured/partition/docx.py the line related to row.grid_cols_after and row.grid_cols_before. It works so it seems that certain rows don't have these fields. Is it possible to make a check and do the for loop only if the fields exist?

def iter_row_cells_as_text(row: _Row) -> Iterator[str]:
            """Generate the text of each cell in `row` as a separate string.

            The text of each paragraph within a cell is separated from the next by a newline
            (`"\n"`). A table nested in a cell is first converted to HTML and then included as a
            string, also separated by a newline.
            """
            # -- each omitted cell at the start of the row (pretty rare) gets the empty string --
            # for _ in range(row.grid_cols_before):
                # yield ""

            for cell in row.cells:
                yield "\n".join(iter_cell_block_items(cell))

            # -- each omitted cell at the end of the row (also rare) gets the empty string --
            # for _ in range(row.grid_cols_after):
                # yield ""
scanny commented 1 month ago

@antfin make sure you have the latest python-docx package installed:

pip install -U python-docx

The .grid_cols_before attribute was added in the latest release of python-docx (v1.1.2). That dependency works on a fresh install but initially didn't work when updating with pip install -U unstructured[docx]. That was fixed a couple days ago but not sure it was released yet.

Closing as assumed fixed but don't hesitate to reopen if it's still giving you trouble after updating :)