DOCX doesn't recognize listitems within textbox

veredmm commented 1 month ago

Describe the bug DOCX doesn't recognize listitems within textbox element of word document

To Reproduce Provide a sample word file with 2 kind of list items. you can see in the screenshot above that only the "plain" listitems are recognized and those within the textbox are missing from the elements list

list_in_texbox_list-item-missing.docx

Screenshots

veredmm commented 1 month ago

this is the file content :

MthwRobinson commented 1 month ago

@scanny - Any thoughts on this one?

scanny commented 1 month ago

We currently extract run text from inline text-box shapes along with the rest of the text in the paragraph to which the textbox is anchored. This behavior was added in this PR: https://github.com/Unstructured-IO/unstructured/pull/2510

We could potentially do this differently such that both inline and floating text-boxes were separately partitioned, which would recognize list-items inside them each as a separate element.

Background

A run is an inline element (think HTML <span>) within a paragraph. Paragraph text can only appear within a run. The text of a paragraph is the concatenation of the text in each of its runs.
A (DOCX) shape contains one of several possible "graphical" items, including a textbox, but can also be an image, chart, SmartArt, etc.
A textbox shape contains one or more paragraphs. In general each non-empty paragraph in a document gives rise to a single element in the output.
A shape can either be inline or floating. An inline shape is treated like a large character and flows with the text of the paragraph. A floating shape is anchored to a paragraph but can be moved to an arbitrary position and text flows around it.

The approach taken in the prior PR was to include any text in an inline textbox with the text of the paragraph in which it occurs.

Because this only applies to inline shapes and the example here is floating, the "Aaa.." text does not appear in the partitioning output.
If it were an inline textbox, all the text would appear together in a single element, like text="AaaBbbccc" because this is the concatenation of all the runs in the textbox and the paragraph it occurs in is otherwise empty.
If we wanted to partition textbox shapes more precisely, we would need to add a subpartitioner that considered the paragraphs in the text-box separately, each giving rise to their own element. In this case the paragraphs are identified as list items so the textbox would produce three ListItem elements that would occur immediately after the element containing the other text in the paragraph (empty in this particular case).

veredmm commented 1 month ago

@scanny - Any suggestions to workarounds in case I have many documents in this structure ( floating shapes with a lot of text inside) ?

scanny commented 1 month ago

@veredmm Not off the top of my head, no. A general-case solution is pretty disruptive to the current partitioner structure (so wouldn't be easy to monkey-patch or whatever) and would require deep domain knowledge of the DOCX format.

That said, if you changed this line: https://github.com/Unstructured-IO/unstructured/blob/main/unstructured/partition/docx.py#L441

from:

"w:r | w:hyperlink | w:r/descendant::wp:inline[ancestor::w:drawing][1]//w:r"

to:

"w:r"
" | w:hyperlink"
" | w:r/descendant::wp:inline[ancestor::w:drawing][1]//w:r"
" | w:r/descendant::wp:anchor[ancestor::w:drawing][1]//w:r"

(note wp:anchor (floating shape) in addition to wp:inline (inline shape))

Then the text inside the textboxes would at least appear in the output.

It wouldn't be pretty because paragraph text would be joined together without a space in between, like:

the quick brown fox

jumped over the lazy dog

would appear as: "whatever text came beforethe quick brown foxjumped over the lazy dogwhatever text came after"

So you'd have to judge whether the benefit was worth the trouble.

veredmm commented 1 month ago

@scanny thanks ! but I wonder why not to just add a space in the join statement to prevent the words joining: text = " ".join( e.text for e in paragraph._p.xpath( "w:r" " | w:hyperlink" " | w:r/descendant::wp:inline[ancestor::w:drawing][1]//w:r" " | w:r/descendant::wp:anchor[ancestor::w:drawing][1]//w:r" ) )

scanny commented 1 month ago

@veredmm Could do, but that would place an extra space between regular runs, which already contain whatever space they need.

Unstructured-IO / unstructured

DOCX doesn't recognize listitems within textbox #3103