Open veredmm opened 1 month ago
this is the file content :
@scanny - Any thoughts on this one?
We currently extract run text from inline text-box shapes along with the rest of the text in the paragraph to which the textbox is anchored. This behavior was added in this PR: https://github.com/Unstructured-IO/unstructured/pull/2510
We could potentially do this differently such that both inline and floating text-boxes were separately partitioned, which would recognize list-items inside them each as a separate element.
Background
<span>
) within a paragraph. Paragraph text can only appear within a run. The text of a paragraph is the concatenation of the text in each of its runs.The approach taken in the prior PR was to include any text in an inline textbox with the text of the paragraph in which it occurs.
text="AaaBbbccc"
because this is the concatenation of all the runs in the textbox and the paragraph it occurs in is otherwise empty.ListItem
elements that would occur immediately after the element containing the other text in the paragraph (empty in this particular case).@scanny - Any suggestions to workarounds in case I have many documents in this structure ( floating shapes with a lot of text inside) ?
@veredmm Not off the top of my head, no. A general-case solution is pretty disruptive to the current partitioner structure (so wouldn't be easy to monkey-patch or whatever) and would require deep domain knowledge of the DOCX format.
That said, if you changed this line: https://github.com/Unstructured-IO/unstructured/blob/main/unstructured/partition/docx.py#L441
from:
"w:r | w:hyperlink | w:r/descendant::wp:inline[ancestor::w:drawing][1]//w:r"
to:
"w:r"
" | w:hyperlink"
" | w:r/descendant::wp:inline[ancestor::w:drawing][1]//w:r"
" | w:r/descendant::wp:anchor[ancestor::w:drawing][1]//w:r"
(note wp:anchor
(floating shape) in addition to wp:inline
(inline shape))
Then the text inside the textboxes would at least appear in the output.
It wouldn't be pretty because paragraph text would be joined together without a space in between, like:
- the quick brown fox
- jumped over the lazy dog
would appear as: "whatever text came beforethe quick brown foxjumped over the lazy dogwhatever text came after"
So you'd have to judge whether the benefit was worth the trouble.
@scanny thanks ! but I wonder why not to just add a space in the join statement to prevent the words joining: text = " ".join( e.text for e in paragraph._p.xpath( "w:r" " | w:hyperlink" " | w:r/descendant::wp:inline[ancestor::w:drawing][1]//w:r" " | w:r/descendant::wp:anchor[ancestor::w:drawing][1]//w:r" ) )
@veredmm Could do, but that would place an extra space between regular runs, which already contain whatever space they need.
Describe the bug DOCX doesn't recognize listitems within textbox element of word document
To Reproduce Provide a sample word file with 2 kind of list items. you can see in the screenshot above that only the "plain" listitems are recognized and those within the textbox are missing from the elements list
list_in_texbox_list-item-missing.docx
Screenshots![image](https://github.com/Unstructured-IO/unstructured/assets/169882103/865e3a1d-a72a-4128-907f-47c8ef8510c7)