Unstructured-IO / unstructured

Open source libraries and APIs to build custom preprocessing pipelines for labeling, training, or production machine learning pipelines.
https://www.unstructured.io/
Apache License 2.0
8.65k stars 704 forks source link

feat/skip ocr for certain element types #3163

Closed beez2022 closed 1 month ago

beez2022 commented 3 months ago

There are some element types (such as images or pictures) that may be required to pass through a custom classifier model to determine if it needs ocr.

Specifying an argument to not ocr for a list of element types, we can skip ocr for those element types. This feature would probably only be applicable to pdfs and images since docx, pptx, etc are not able to capture element types like image

scanny commented 3 months ago

@beez2022 Can you provide an example document? And which file formats is this an issue for?

In general, images and pictures are the same thing and would be partitioned as an Image element.

Could you accomplish what you want with the existing capability by not OCR-ing at all? Only images would be OCR-ed.

beez2022 commented 3 months ago

Good morning @scanny. I have the intention to make a contribution on this issue. Per the guidelines for contribution, I have raised this issue. The file formats for this issue would be pdf and images (.jpg, .png). I realised that they ultimately call the same function for ocr. We have a requirement to classify images extracted from a pdf document before deciding if it needs ocr. We would also potentially use another ocr tool instead of the one that unstructured currently provides. Thus I thought this functionality of turning off ocr for certain element types would be helpful. I made the distinction between Picture and Image because I realised that a .png file goes through unstructured_models and output "image" elements as "picture". Lastly, you mentioned that "only images would be ocr-ed" - so currently, is there an option to turn off ocr of images that are embedded in pdfs? Thank you

christinestraub commented 3 months ago

@beez2022 As of now, there is no option to turn off ocr of images that are embedded in pdfs.

beez2022 commented 3 months ago

Thanks @christinestraub @scanny