Unstructured-IO / unstructured

Open source libraries and APIs to build custom preprocessing pipelines for labeling, training, or production machine learning pipelines.
https://www.unstructured.io/
Apache License 2.0
7.54k stars 595 forks source link

DOCX doesn't recognize listitems within textbox #3103

Open veredmm opened 1 month ago

veredmm commented 1 month ago

Describe the bug DOCX doesn't recognize listitems within textbox element of word document

To Reproduce Provide a sample word file with 2 kind of list items. you can see in the screenshot above that only the "plain" listitems are recognized and those within the textbox are missing from the elements list

list_in_texbox_list-item-missing.docx

Screenshots image

veredmm commented 1 month ago

this is the file content : image

MthwRobinson commented 1 month ago

@scanny - Any thoughts on this one?

scanny commented 1 month ago

We currently extract run text from inline text-box shapes along with the rest of the text in the paragraph to which the textbox is anchored. This behavior was added in this PR: https://github.com/Unstructured-IO/unstructured/pull/2510

We could potentially do this differently such that both inline and floating text-boxes were separately partitioned, which would recognize list-items inside them each as a separate element.

Background

The approach taken in the prior PR was to include any text in an inline textbox with the text of the paragraph in which it occurs.

veredmm commented 1 month ago

@scanny - Any suggestions to workarounds in case I have many documents in this structure ( floating shapes with a lot of text inside) ?

scanny commented 1 month ago

@veredmm Not off the top of my head, no. A general-case solution is pretty disruptive to the current partitioner structure (so wouldn't be easy to monkey-patch or whatever) and would require deep domain knowledge of the DOCX format.

That said, if you changed this line: https://github.com/Unstructured-IO/unstructured/blob/main/unstructured/partition/docx.py#L441

from:

"w:r | w:hyperlink | w:r/descendant::wp:inline[ancestor::w:drawing][1]//w:r"

to:

"w:r"
" | w:hyperlink"
" | w:r/descendant::wp:inline[ancestor::w:drawing][1]//w:r"
" | w:r/descendant::wp:anchor[ancestor::w:drawing][1]//w:r"

(note wp:anchor (floating shape) in addition to wp:inline (inline shape))

Then the text inside the textboxes would at least appear in the output.

It wouldn't be pretty because paragraph text would be joined together without a space in between, like:

  • the quick brown fox
  • jumped over the lazy dog

would appear as: "whatever text came beforethe quick brown foxjumped over the lazy dogwhatever text came after"

So you'd have to judge whether the benefit was worth the trouble.

veredmm commented 1 month ago

@scanny thanks ! but I wonder why not to just add a space in the join statement to prevent the words joining: text = " ".join( e.text for e in paragraph._p.xpath( "w:r" " | w:hyperlink" " | w:r/descendant::wp:inline[ancestor::w:drawing][1]//w:r" " | w:r/descendant::wp:anchor[ancestor::w:drawing][1]//w:r" ) )

scanny commented 1 month ago

@veredmm Could do, but that would place an extra space between regular runs, which already contain whatever space they need.