Unstructured-IO / unstructured

Open source libraries and APIs to build custom preprocessing pipelines for labeling, training, or production machine learning pipelines.
https://www.unstructured.io/
Apache License 2.0
9.25k stars 766 forks source link

feat/ Param to control the behavior of chunking when encountering Table #2990

Open LucasOliveira44 opened 6 months ago

LucasOliveira44 commented 6 months ago

Is your feature request related to a problem? Please describe.

Currently, when processing PDF documents using the chunk_by_title function from the Unstructured library, a Table element always forms a separate chunk, even if it immediately follows another chunk. This behavior can lead to a fragmented chunk stream where chunks alternate with table chunks. This behavior is not ideal for our use case, as I'd like to maintain contiguous text chunks where possible. In my case I have the chunks with the title and then the text corresponding to that section, I would like this to also work for the tables to optimize chunking and facilitate retrieval.

@georearl A Table element is never combined with any other element (even another Table) to form a chunk. So where a Table appears in the element stream, the prior chunk will close, the table chunk will appear, and a new chunk will start after the table.

This is what we're seeing in the chunk stream; chunk, table, chunk, table, etc. Note that Table is a chunk-type, along with CompositeElement and TableChunk. So you could think of this as text-chunk, table-chunk, text-chunk, table-chunk, ...

Closing for now as no bug is evident here but feel free to ask any other questions about this if you need more clarification :)

Originally posted by @scanny in https://github.com/Unstructured-IO/unstructured/issues/2699#issuecomment-2023677754

Describe the solution you'd like

It would be nice to have an option or feature that allows me to control the behavior of chunking when encountering Table elements. Ideally, I would like to be able to specify that a Table element should not break the continuity of text chunks, and instead be included within the same chunk as the preceding text.

Describe alternatives you've considered

I have attempted to modify the type of the Table element before chunking and removed certain fields from its metadata, such as text_as_html and table_as_cells, in an attempt to prevent the Table element from forming a separate chunk. However, this approach did not achieve the desired result.

Additional context

This is the current behaviour:

for chu in chunks: 
    print(len(chu.metadata.orig_elements))
    for el in chu.metadata.orig_elements:
        print(el.category)

Screenshot 2024-05-08 at 23 29 02

cragwolfe commented 6 months ago

@LucasOliveira44 , thanks for submitting this issue.

It would be nice to have an option or feature that allows me to control the behavior of chunking when encountering Table elements. Ideally, I would like to be able to specify that a Table element should not break the continuity of text chunks, and instead be included within the same chunk as the preceding text.

As text, html, or other (like markdown)?

scanny commented 6 months ago

One other possible behavior variation that occurred to me, not sure whether it's useful or not:

When option {x} has value {y}, content within tables is partitioned like other content: (i.e. document body paragraphs)

@LucasOliveira44 one reason I'm thinking that this behavior might be interesting is that tables tend to be big. New elements are never "partially" combined with the prior element to make a chunk. So I think the behavior might end up being largely the same if we simply removed the "never combine a Table element with any other element" rule.

LucasOliveira44 commented 6 months ago

@LucasOliveira44 , thanks for submitting this issue.

It would be nice to have an option or feature that allows me to control the behavior of chunking when encountering Table elements. Ideally, I would like to be able to specify that a Table element should not break the continuity of text chunks, and instead be included within the same chunk as the preceding text.

As text, html, or other (like markdown)?

I'm not sure I understand what you mean, but the whole table element, in my case I am only interested in the text, because the text_as_html crops the last row of the table. As mentioned in https://github.com/Unstructured-IO/unstructured/issues/2478#issue-2108473946 so I have given up on using text_as_html.

LucasOliveira44 commented 6 months ago

One other possible behavior variation that occurred to me, not sure whether it's useful or not:

When option {x} has value {y}, content within tables is partitioned like other content: (i.e. document body paragraphs)

  • content is partitioned in left-to-right, top-to-bottom order by row
  • the resulting elements are not grouped into a single Table element
  • in general, each cell will appear as zero-or-more distinct elements (blank cell contributes zero elements)
  • during chunking these elements can be combined with adjacent elements just like those that occur outside a table.
  • no .metadata.text_as_html field appears for any elements extracted from table content.

@LucasOliveira44 one reason I'm thinking that this behavior might be interesting is that tables tend to be big. New elements are never "partially" combined with the prior element to make a chunk. So I think the behavior might end up being largely the same if we simply removed the "never combine a Table element with any other element" rule.

Thanks for your reply @scanny, In my case the tables are small, and it would be nice to have the table in the same section as the section title to retain information.

Also because there isn't a function to create a new element, I copied the table element to allocate the table summary. The table summary is a summary of the table generated by a LLM, to have it in as an element, I duplicated the table element, replaced element.text with the table summary and removed the text_as_html and table_as_cell fields from the metadata. This was the best way I found to do this, advantages are that the parent id is the same so it is linked to the section title, disadvantages are that the id is the same as the table element, because I can't change the id, and that the element is treated as a table for some reason. If you'd like I could open an issue for this feature as well.

Both these elements are treated as a table when chunking, if you know of a way to add the table summary to the same chunk as the section title and not the table that would fix my issue for now.

scanny commented 6 months ago

@LucasOliveira44 You can convert Table elements to text roughly like this:

from unstructured.chunking.basic import chunk_elements
from unstructured.documents.elements import Element, ElementMetadata, Table, Text

elements = partition_..(...)  # -- no chunking_strategy arg, just partitioning

def filter_tables_to_text(elements: list[Element]) -> Iterator[Element]:
    for e in elements:
        if isinstance(e, Table):
            yield Text(
                text=e.text,
                metadata=ElementMetadata(
                    ... whatever metadata fields you want to carry over ...
                )
            )
        else:
            yield e

chunks = chunk_elements(filter_tables_to_text(elements))

There are other approaches to the "filtered" metadata, for example you might just want to start with metadata=e.metadata for a start and see if that will get it done for you. And you can selectively "remove" any give metadata field by assigning None to it, like e.metadata.text_as_html = None. So there might not be a compelling reason to actually construct a new ElementMetadata instance.

This of course relies on doing partitioning and chunking as separate steps.

scanny commented 6 months ago

This approach of changing Table elements to another element type is the only approach that's going to work I believe. Combining tables into the same chunk as non-table chunks would just be too disruptive to chunking overall and I expect increase the complexity for most users without tangible benefit.

Why don't we see if we can get a solution like this to work for your use case and then see where we stand.