Closed asakinory closed 6 months ago
@asakinory recently the .metadata.orig_elements
field was added to metadata for chunks. This allows perhaps more convenient post-processing of chunks to So you can post-process chunks to process them in whatever way makes sense for your use-case.
For example, you can retrieve whatever Image elements were incorporated into a chunk with the expression:
chunk_images = [e for e in chunk.orig_elements if type(e).__name__ == "Image"]
Any text extracted from the image using OCR will appear in the chunk.text
in element-stream order (document order generally speaking).
Let us know how you go with that and we can consider further steps if necessary :)
Description:
The current chunking strategies in the Unstructured library do not handle images properly. When encountering an image element, the chunker simply discards it, resulting in the loss of image data. This is a significant limitation for documents that contain important images.
Steps to Reproduce:
Partition a document that contains images. Apply a chunking strategy (e.g., "basic" or "by_title") to the partitioned elements. Observe that the images are missing from the resulting chunks.
Expected Behavior:
Chunking strategies should preserve images and incorporate them into chunks in a meaningful way. This could involve: Chunking based on image size. Grouping images with related text elements. Creating special chunk types for images. Providing configuration options for users to specify how they want images to be handled.
Workarounds:
Currently, the following workarounds can be used: Pre-process documents to extract images before chunking. Post-process chunks to re-insert images based on metadata. (see below) However, these workarounds are not ideal and require additional effort.
Request:
Implement a solution to preserve images during chunking. Otherwise, give a warning that images are discarded when trying to chunk an Image element. It's also possible to make this optional with a new parameter in all chunking methods. Additional Information: Here is my current workaround which can also be used to create a test for this:
A possible solution I think is to create a new ImagePreChunk and ImageChunk classes similair to PreChunkTable and TableChunk. Implementing it this way will also cover the rare cases of images with long texts within them.