Unstructured-IO / unstructured

Open source libraries and APIs to build custom preprocessing pipelines for labeling, training, or production machine learning pipelines.
https://www.unstructured.io/
Apache License 2.0
7.4k stars 573 forks source link

bug - duplicates merged cell text following issue #2106 #3250

Open veredmm opened 1 week ago

veredmm commented 1 week ago

still having this duplicated text problem with this kind of table structure :

merged_table2.docx

table doc:

image

after partition_docx :

image

python-docx 1.1.2 unstructured 0.14.3

scanny commented 1 week ago

@veredmm I'm getting "HEADER 5 4 3 2 1 AAA BBB CCC" as elements[0].text for that document, which is the expected behavior and does not repeat the text in that merged cell.

The .metadata.text_as_html for that Table element is this uniform 3 row x 8 col table:

  <table>
    <thead>
      <tr>
        <th>HEADER</th>
        <th>HEADER</th>
        <th>HEADER</th>
        <th>HEADER</th>
        <th>HEADER</th>
        <th>HEADER</th>
        <th>HEADER</th>
        <th>HEADER</th>
      </tr>
    </thead>
    <tbody>
      <tr>
        <td>5</td>
        <td>4</td>
        <td>4</td>
        <td>3</td>
        <td>2</td>
        <td>2</td>
        <td>1</td>
        <td>1</td>
      </tr>
      <tr>
        <td>AAA\nBBB\nCCC</td>
        <td>AAA\nBBB\nCCC</td>
        <td>AAA\nBBB\nCCC</td>
        <td>AAA\nBBB\nCCC</td>
        <td>AAA\nBBB\nCCC</td>
        <td>AAA\nBBB\nCCC</td>
        <td>AAA\nBBB\nCCC</td>
        <td>AAA\nBBB\nCCC</td>
      </tr>
    </tbody>
  </table>

The HTML table in .text_as_html is purposely made "uniform" (same number of cells in each row), which is why the same content appears in each "grid" cell of a merged cell.

If you think that should look differently, please suggest (in HTML) what you think it should look like instead and we'll consider a change.

veredmm commented 1 week ago

thanks @scanny I would suggest that the content of the merged cell will appear only in the first cell(td) of the table row and the other cells will be empty