Unstructured-IO / unstructured

Open source libraries and APIs to build custom preprocessing pipelines for labeling, training, or production machine learning pipelines.
https://www.unstructured.io/
Apache License 2.0
8.43k stars 692 forks source link

feat/add optional flag to disable image text extraction under chunking strategy 'hi-res' #3520

Open ajpanyteam opened 1 month ago

ajpanyteam commented 1 month ago

Is your feature request related to a problem? Please describe. I would like an optional flag for chunking strategy 'hi-res' so that tables are extracted but not images. Image text extracted as gibberish if the image has text in it. This is impacting RAG.

Describe the solution you'd like An optional flag to switch off image extraction.

Describe alternatives you've considered Tried to partition first and then pick off the elements I do not want, but then I need to chunk it. But the TS SDK does not support chunk_by_title function that is present in Python.

from unstructured.chunking.title import chunk_by_title
chunks = chunk_by_title(elements)  

Additional context Example of image text extracted:

text: « Realtime - irk v * Realime . & ta to Delta rigger = 1 Realtim: ond: o ager =1 second: JSON to Dota (clean up) Watermar oving A rersoes whh — I e L - @8 Azure Data Laj JISON files DBX Autoloader DBX DBSQL Warehouse Structyred Streaming Workflows n Data Apps f A DELTA LAKE DELTA LAKE DELTA LAKE Azure loT Hub T 'Raspberry Pi On site sensors. @400Hz\n + \n + 'Sample Workflow:\n

scanny commented 1 month ago

I think the behavior being requested here is that images are still extracted, just not OCRed, such that any text in the image does not end up in Image.text.