Unstructured-IO / unstructured

Open source libraries and APIs to build custom preprocessing pipelines for labeling, training, or production machine learning pipelines.
https://www.unstructured.io/
Apache License 2.0
7.54k stars 595 forks source link

bug/combineUnderNChars not working properly #3138

Closed leSullivan closed 2 weeks ago

leSullivan commented 1 month ago

Describe the bug The combineUnderNChars parameter doesn't work as expected for me. For different documents i got for example:

chunk_len 167 chunk_len 183 chunk_len 1613 chunk_len 529 chunk_len 2111

or

chunk_len 393 chunk_len 477 chunk_len 102 chunk_len 751 chunk_len 750 chunk_len 304 chunk_len 134 chunk_len 398 chunk_len 177 chunk_len 618 chunk_len 13

To Reproduce

client.partition({
            strategy: 'hi_res',
            chunkingStrategy: 'by_title',
            maxCharacters: 10000,
            combineUnderNChars: 1500,
            newAfterNChars: 1500, (default)
            pdfInferTableStructure: true,
})

with unstructured docker image v 0.0.68 and "unstructured-client": "^0.10.6" js lib

Expected behavior The chunks next to eachother which have a total length lower than my combineUnderNChars should be combined

scanny commented 1 month ago

Can you characterize the chunks more completely? Something like:

for c in chunks:
    print(
        f"{type(c).__name__}"
        f" - text_len={len(c.text)"
        f" html_len={len(c.metadata.text_as_html) if c.metadata.text_as_html else 0}"
    )
leSullivan commented 1 month ago
print(
        f"{type(c).__name__}"
        f" - text_len={len(c.text)"
        f" html_len={len(c.metadata.text_as_html) if c.metadata.text_as_html else 0}"
    )

sure:

Type: CompositeElement Length: 167 HTML Length undefined

Type: Table Length: 66 HTML Length 183

Type: CompositeElement Length: 1613 HTML Length undefined

Type: Table Length: 602 HTML Length 529

Type: CompositeElement Length: 2111 HTML Length undefined

leSullivan commented 1 month ago

Okay from what i can sense doing further experiments the combineUnderNChars doesnt combine Composite Elements and Tables. Is there a specific reason for that ?

scanny commented 1 month ago

Yeah, this is the expected behavior. Table elements are not combined with any other elements during chunking.

One characteristic of Table elements is that they have .metadata.text_as_html which preserves the table structure. I'm sure that's at least part of the reason. That metadata would be either lost or at least hard to combine with what a CompositeElement has.

@MthwRobinson was there more to the original design choice to segregate table elements during chunking?

@leSullivan can you describe the behavior you would be looking for and how it would be preferable for your use-case?

MthwRobinson commented 1 month ago

@scanny - To avoid breaking up the table. Not sure we'd change that behavior in this chunker since it's been out in the wild for a while and I don't think we want a breaking change there, but there could certainly be another chunking method that does split up the tables.

leSullivan commented 2 weeks ago

@scanny sorry for the late reply.

My expected behaviour would be to ensure "combineUnderNChars" works no matter what the chunk type is. Another Approach which im testing currently is to Split Tables if they exceed the maxChunkSize, by row and giving each subchunk of the table the table header aswell. This of course then only works with larger maxChunkSizes.

Thanks for your help !