Open veredmm opened 1 week ago
@veredmm I'm getting "HEADER 5 4 3 2 1 AAA BBB CCC"
as elements[0].text
for that document, which is the expected behavior and does not repeat the text in that merged cell.
The .metadata.text_as_html
for that Table
element is this uniform 3 row x 8 col table:
<table>
<thead>
<tr>
<th>HEADER</th>
<th>HEADER</th>
<th>HEADER</th>
<th>HEADER</th>
<th>HEADER</th>
<th>HEADER</th>
<th>HEADER</th>
<th>HEADER</th>
</tr>
</thead>
<tbody>
<tr>
<td>5</td>
<td>4</td>
<td>4</td>
<td>3</td>
<td>2</td>
<td>2</td>
<td>1</td>
<td>1</td>
</tr>
<tr>
<td>AAA\nBBB\nCCC</td>
<td>AAA\nBBB\nCCC</td>
<td>AAA\nBBB\nCCC</td>
<td>AAA\nBBB\nCCC</td>
<td>AAA\nBBB\nCCC</td>
<td>AAA\nBBB\nCCC</td>
<td>AAA\nBBB\nCCC</td>
<td>AAA\nBBB\nCCC</td>
</tr>
</tbody>
</table>
The HTML table in .text_as_html
is purposely made "uniform" (same number of cells in each row), which is why the same content appears in each "grid" cell of a merged cell.
If you think that should look differently, please suggest (in HTML) what you think it should look like instead and we'll consider a change.
thanks @scanny I would suggest that the content of the merged cell will appear only in the first cell(td) of the table row and the other cells will be empty
still having this duplicated text problem with this kind of table structure :
merged_table2.docx
table doc:
after partition_docx :
python-docx 1.1.2 unstructured 0.14.3