Unstructured-IO / unstructured-api

Apache License 2.0
508 stars 108 forks source link

Tables are not part of the output #381

Closed leolorenzoluis closed 7 months ago

leolorenzoluis commented 7 months ago

Describe the bug When using this sample document it doesn't get the contents of the table. I also tested with other tables in word, and no luck. Only pure text is supported?

To Reproduce

whatever.docx

req = shared.PartitionParameters(
    # Note that this currently only supports a single file
    files=shared.Files(
        content=stream.read(),
        file_name=file_name,
    ),
    # Other partition params
    strategy="auto",
    include_page_breaks="true",
    # chunking_strategy="by_title",
    # max_characters=5000,
    # combine_under_n_chars=1500
    # xml_keep_tags=True,
    # skip_infer_table_types=False
)

Environment:

scanny commented 7 months ago

@leolorenzoluis our docx partitioner definitely supports tables, those are a very important element-type.

When I partition this file I get three elements: [Table, PageBreak, Title] containing the text:

'Whatever Daga tangina |  | |  | |  | |  |',
'',
'Ewan',

respectively.

Inspecting the docx contents, I see that much of the content is contained in "structured data tags" more commonly called something like "form fields". Text inside form-fields is currently not captured, which explains this behavior for this particular document.

Form fields are relatively uncommon in the broad corpus of Word documents, but are perhaps common in certain collections, like if you use Word to capture information from respondents for some reason. If you try this with other .docx files I expect you'll see the behavior you're expecting.

If you work with a lot of Word documents with text in form fields, let us know more about your situation and use cases and we can consider adding support for this.

scanny commented 7 months ago

I'm going to close this issue for now @leolorenzoluis just because it is not immediately actionable. But don't hesitate to reopen it if you are willing to describe your use-cases and advocate for us adding this support :)