Chunking Ignores Images

Description:

The current chunking strategies in the Unstructured library do not handle images properly. When encountering an image element, the chunker simply discards it, resulting in the loss of image data. This is a significant limitation for documents that contain important images.

Steps to Reproduce:

Partition a document that contains images. Apply a chunking strategy (e.g., "basic" or "by_title") to the partitioned elements. Observe that the images are missing from the resulting chunks.

Expected Behavior:

Chunking strategies should preserve images and incorporate them into chunks in a meaningful way. This could involve: Chunking based on image size. Grouping images with related text elements. Creating special chunk types for images. Providing configuration options for users to specify how they want images to be handled.

Workarounds:

Currently, the following workarounds can be used: Pre-process documents to extract images before chunking. Post-process chunks to re-insert images based on metadata. (see below) However, these workarounds are not ideal and require additional effort.

Request:

Implement a solution to preserve images during chunking. Otherwise, give a warning that images are discarded when trying to chunk an Image element. It's also possible to make this optional with a new parameter in all chunking methods. Additional Information: Here is my current workaround which can also be used to create a test for this:

def workaround_chunk(
    elements,
    max_characters=4000,
    multipage_sections=True,
    new_after_n_chars=None,
    combine_text_under_n_chars=3000,
    overlap=0,
    overlap_all=False,
    isolate_chunk_types=["Image", "Table"],
    base_method="chunk_by_title"):

    chunks = []
    elements_for_chunking = []
    for i in range(0, len(elements)):
        element_dict = elements[i].to_dict()
        if element_dict["type"] in isolate_chunk_types or i == len(elements) - 1:
            if base_method == "basic":
                new_chunks = chunk_elements(
                    elements= elements_for_chunking,
                    max_characters=max_characters,
                    combine_text_under_n_chars=combine_text_under_n_chars,
                    overlap=0,
                    overlap_all=overlap_all,
                    new_after_n_chars=new_after_n_chars)
            elif base_method == "chunk_by_title":
                new_chunks = chunk_by_title(
                    elements= elements_for_chunking,
                    max_characters=max_characters,
                    combine_text_under_n_chars=combine_text_under_n_chars,
                    overlap=0,
                    overlap_all=overlap_all,
                    new_after_n_chars=new_after_n_chars,
                    multipage_sections=multipage_sections
                )
            # append to chunks the items in new chunks, the result should be a flat list of elements
            chunks = chunks + new_chunks
            elements_for_chunking = []
            chunks.append(elements[i])

        else:
            elements_for_chunking.append(elements[i])

    return chunks

A possible solution I think is to create a new ImagePreChunk and ImageChunk classes similair to PreChunkTable and TableChunk. Implementing it this way will also cover the rare cases of images with long texts within them.

Unstructured-IO / unstructured