Closed leSullivan closed 2 weeks ago
Can you characterize the chunks more completely? Something like:
for c in chunks:
print(
f"{type(c).__name__}"
f" - text_len={len(c.text)"
f" html_len={len(c.metadata.text_as_html) if c.metadata.text_as_html else 0}"
)
print( f"{type(c).__name__}" f" - text_len={len(c.text)" f" html_len={len(c.metadata.text_as_html) if c.metadata.text_as_html else 0}" )
sure:
Type: CompositeElement Length: 167 HTML Length undefined
Type: Table Length: 66 HTML Length 183
Type: CompositeElement Length: 1613 HTML Length undefined
Type: Table Length: 602 HTML Length 529
Type: CompositeElement Length: 2111 HTML Length undefined
Okay from what i can sense doing further experiments the combineUnderNChars doesnt combine Composite Elements and Tables. Is there a specific reason for that ?
Yeah, this is the expected behavior. Table
elements are not combined with any other elements during chunking.
One characteristic of Table
elements is that they have .metadata.text_as_html
which preserves the table structure. I'm sure that's at least part of the reason. That metadata would be either lost or at least hard to combine with what a CompositeElement
has.
@MthwRobinson was there more to the original design choice to segregate table elements during chunking?
@leSullivan can you describe the behavior you would be looking for and how it would be preferable for your use-case?
@scanny - To avoid breaking up the table. Not sure we'd change that behavior in this chunker since it's been out in the wild for a while and I don't think we want a breaking change there, but there could certainly be another chunking method that does split up the tables.
@scanny sorry for the late reply.
My expected behaviour would be to ensure "combineUnderNChars" works no matter what the chunk type is. Another Approach which im testing currently is to Split Tables if they exceed the maxChunkSize, by row and giving each subchunk of the table the table header aswell. This of course then only works with larger maxChunkSizes.
Thanks for your help !
Describe the bug The combineUnderNChars parameter doesn't work as expected for me. For different documents i got for example:
or
To Reproduce
with unstructured docker image v 0.0.68 and "unstructured-client": "^0.10.6" js lib
Expected behavior The chunks next to eachother which have a total length lower than my combineUnderNChars should be combined