Is your feature request related to a problem? Please describe.
I would like an optional flag for chunking strategy 'hi-res' so that tables are extracted but not images. Image text extracted as gibberish if the image has text in it. This is impacting RAG.
Describe the solution you'd like
An optional flag to switch off image extraction.
Describe alternatives you've considered
Tried to partition first and then pick off the elements I do not want, but then I need to chunk it. But the TS SDK does not support chunk_by_title function that is present in Python.
from unstructured.chunking.title import chunk_by_title
chunks = chunk_by_title(elements)
Additional context
Example of image text extracted:
text: « Realtime - irk v * Realime . & ta to Delta rigger = 1 Realtim: ond: o ager =1 second: JSON to Dota (clean up) Watermar oving A rersoes whh — I e L - @8 Azure Data Laj JISON files DBX Autoloader DBX DBSQL Warehouse Structyred Streaming Workflows n Data Apps f A DELTA LAKE DELTA LAKE DELTA LAKE Azure loT Hub T 'Raspberry Pi On site sensors. @400Hz\n + \n + 'Sample Workflow:\n
I think the behavior being requested here is that images are still extracted, just not OCRed, such that any text in the image does not end up in Image.text.
Is your feature request related to a problem? Please describe. I would like an optional flag for chunking strategy 'hi-res' so that tables are extracted but not images. Image text extracted as gibberish if the image has text in it. This is impacting RAG.
Describe the solution you'd like An optional flag to switch off image extraction.
Describe alternatives you've considered Tried to partition first and then pick off the elements I do not want, but then I need to chunk it. But the TS SDK does not support chunk_by_title function that is present in Python.
Additional context Example of image text extracted:
text: « Realtime - irk v * Realime . & ta to Delta rigger = 1 Realtim: ond: o ager =1 second: JSON to Dota (clean up) Watermar oving A rersoes whh — I e L - @8 Azure Data Laj JISON files DBX Autoloader DBX DBSQL Warehouse Structyred Streaming Workflows n Data Apps f A DELTA LAKE DELTA LAKE DELTA LAKE Azure loT Hub T 'Raspberry Pi On site sensors. @400Hz\n + \n + 'Sample Workflow:\n