Open LucasOliveira44 opened 6 months ago
@LucasOliveira44 , thanks for submitting this issue.
It would be nice to have an option or feature that allows me to control the behavior of chunking when encountering Table elements. Ideally, I would like to be able to specify that a Table element should not break the continuity of text chunks, and instead be included within the same chunk as the preceding text.
As text, html, or other (like markdown)?
One other possible behavior variation that occurred to me, not sure whether it's useful or not:
When option {x} has value {y}, content within tables is partitioned like other content: (i.e. document body paragraphs)
Table
element.metadata.text_as_html
field appears for any elements extracted from table content.@LucasOliveira44 one reason I'm thinking that this behavior might be interesting is that tables tend to be big. New elements are never "partially" combined with the prior element to make a chunk. So I think the behavior might end up being largely the same if we simply removed the "never combine a Table element with any other element" rule.
@LucasOliveira44 , thanks for submitting this issue.
It would be nice to have an option or feature that allows me to control the behavior of chunking when encountering Table elements. Ideally, I would like to be able to specify that a Table element should not break the continuity of text chunks, and instead be included within the same chunk as the preceding text.
As text, html, or other (like markdown)?
I'm not sure I understand what you mean, but the whole table element, in my case I am only interested in the text, because the text_as_html crops the last row of the table. As mentioned in https://github.com/Unstructured-IO/unstructured/issues/2478#issue-2108473946 so I have given up on using text_as_html.
One other possible behavior variation that occurred to me, not sure whether it's useful or not:
When option {x} has value {y}, content within tables is partitioned like other content: (i.e. document body paragraphs)
- content is partitioned in left-to-right, top-to-bottom order by row
- the resulting elements are not grouped into a single
Table
element- in general, each cell will appear as zero-or-more distinct elements (blank cell contributes zero elements)
- during chunking these elements can be combined with adjacent elements just like those that occur outside a table.
- no
.metadata.text_as_html
field appears for any elements extracted from table content.@LucasOliveira44 one reason I'm thinking that this behavior might be interesting is that tables tend to be big. New elements are never "partially" combined with the prior element to make a chunk. So I think the behavior might end up being largely the same if we simply removed the "never combine a Table element with any other element" rule.
Thanks for your reply @scanny, In my case the tables are small, and it would be nice to have the table in the same section as the section title to retain information.
Also because there isn't a function to create a new element, I copied the table element to allocate the table summary. The table summary is a summary of the table generated by a LLM, to have it in as an element, I duplicated the table element, replaced element.text with the table summary and removed the text_as_html and table_as_cell fields from the metadata. This was the best way I found to do this, advantages are that the parent id is the same so it is linked to the section title, disadvantages are that the id is the same as the table element, because I can't change the id, and that the element is treated as a table for some reason. If you'd like I could open an issue for this feature as well.
Both these elements are treated as a table when chunking, if you know of a way to add the table summary to the same chunk as the section title and not the table that would fix my issue for now.
@LucasOliveira44 You can convert Table
elements to text roughly like this:
from unstructured.chunking.basic import chunk_elements
from unstructured.documents.elements import Element, ElementMetadata, Table, Text
elements = partition_..(...) # -- no chunking_strategy arg, just partitioning
def filter_tables_to_text(elements: list[Element]) -> Iterator[Element]:
for e in elements:
if isinstance(e, Table):
yield Text(
text=e.text,
metadata=ElementMetadata(
... whatever metadata fields you want to carry over ...
)
)
else:
yield e
chunks = chunk_elements(filter_tables_to_text(elements))
There are other approaches to the "filtered" metadata, for example you might just want to start with metadata=e.metadata
for a start and see if that will get it done for you. And you can selectively "remove" any give metadata field by assigning None
to it, like e.metadata.text_as_html = None
. So there might not be a compelling reason to actually construct a new ElementMetadata
instance.
This of course relies on doing partitioning and chunking as separate steps.
This approach of changing Table
elements to another element type is the only approach that's going to work I believe. Combining tables into the same chunk as non-table chunks would just be too disruptive to chunking overall and I expect increase the complexity for most users without tangible benefit.
Why don't we see if we can get a solution like this to work for your use case and then see where we stand.
Is your feature request related to a problem? Please describe.
Currently, when processing PDF documents using the chunk_by_title function from the Unstructured library, a Table element always forms a separate chunk, even if it immediately follows another chunk. This behavior can lead to a fragmented chunk stream where chunks alternate with table chunks. This behavior is not ideal for our use case, as I'd like to maintain contiguous text chunks where possible. In my case I have the chunks with the title and then the text corresponding to that section, I would like this to also work for the tables to optimize chunking and facilitate retrieval.
Originally posted by @scanny in https://github.com/Unstructured-IO/unstructured/issues/2699#issuecomment-2023677754
Describe the solution you'd like
It would be nice to have an option or feature that allows me to control the behavior of chunking when encountering Table elements. Ideally, I would like to be able to specify that a Table element should not break the continuity of text chunks, and instead be included within the same chunk as the preceding text.
Describe alternatives you've considered
I have attempted to modify the type of the Table element before chunking and removed certain fields from its metadata, such as text_as_html and table_as_cells, in an attempt to prevent the Table element from forming a separate chunk. However, this approach did not achieve the desired result.
Additional context
This is the current behaviour: