Unstructured-IO / unstructured

Open source libraries and APIs to build custom preprocessing pipelines for labeling, training, or production machine learning pipelines.
https://www.unstructured.io/
Apache License 2.0
9.02k stars 742 forks source link

Chunking Ignores Images #2674

Closed asakinory closed 6 months ago

asakinory commented 7 months ago

Description:

The current chunking strategies in the Unstructured library do not handle images properly. When encountering an image element, the chunker simply discards it, resulting in the loss of image data. This is a significant limitation for documents that contain important images.

Steps to Reproduce:

Partition a document that contains images. Apply a chunking strategy (e.g., "basic" or "by_title") to the partitioned elements. Observe that the images are missing from the resulting chunks.

Expected Behavior:

Chunking strategies should preserve images and incorporate them into chunks in a meaningful way. This could involve: Chunking based on image size. Grouping images with related text elements. Creating special chunk types for images. Providing configuration options for users to specify how they want images to be handled.

Workarounds:

Currently, the following workarounds can be used: Pre-process documents to extract images before chunking. Post-process chunks to re-insert images based on metadata. (see below) However, these workarounds are not ideal and require additional effort.

Request:

Implement a solution to preserve images during chunking. Otherwise, give a warning that images are discarded when trying to chunk an Image element. It's also possible to make this optional with a new parameter in all chunking methods. Additional Information: Here is my current workaround which can also be used to create a test for this:

def workaround_chunk(
    elements,
    max_characters=4000,
    multipage_sections=True,
    new_after_n_chars=None,
    combine_text_under_n_chars=3000,
    overlap=0,
    overlap_all=False,
    isolate_chunk_types=["Image", "Table"],
    base_method="chunk_by_title"):

    chunks = []
    elements_for_chunking = []
    for i in range(0, len(elements)):
        element_dict = elements[i].to_dict()
        if element_dict["type"] in isolate_chunk_types or i == len(elements) - 1:
            if base_method == "basic":
                new_chunks = chunk_elements(
                    elements= elements_for_chunking,
                    max_characters=max_characters,
                    combine_text_under_n_chars=combine_text_under_n_chars,
                    overlap=0,
                    overlap_all=overlap_all,
                    new_after_n_chars=new_after_n_chars)
            elif base_method == "chunk_by_title":
                new_chunks = chunk_by_title(
                    elements= elements_for_chunking,
                    max_characters=max_characters,
                    combine_text_under_n_chars=combine_text_under_n_chars,
                    overlap=0,
                    overlap_all=overlap_all,
                    new_after_n_chars=new_after_n_chars,
                    multipage_sections=multipage_sections
                )
            # append to chunks the items in new chunks, the result should be a flat list of elements
            chunks = chunks + new_chunks
            elements_for_chunking = []
            chunks.append(elements[i])

        else:
            elements_for_chunking.append(elements[i])

    return chunks

A possible solution I think is to create a new ImagePreChunk and ImageChunk classes similair to PreChunkTable and TableChunk. Implementing it this way will also cover the rare cases of images with long texts within them.

scanny commented 6 months ago

@asakinory recently the .metadata.orig_elements field was added to metadata for chunks. This allows perhaps more convenient post-processing of chunks to So you can post-process chunks to process them in whatever way makes sense for your use-case.

For example, you can retrieve whatever Image elements were incorporated into a chunk with the expression:

chunk_images = [e for e in chunk.orig_elements if type(e).__name__ == "Image"]

Any text extracted from the image using OCR will appear in the chunk.text in element-stream order (document order generally speaking).

Let us know how you go with that and we can consider further steps if necessary :)