Unstructured-IO / unstructured

Open source libraries and APIs to build custom preprocessing pipelines for labeling, training, or production machine learning pipelines.
https://www.unstructured.io/
Apache License 2.0
7.49k stars 584 forks source link

feat/Excluding Specific Types #3149

Open tevfikcagridural opened 1 month ago

tevfikcagridural commented 1 month ago

Is your feature request related to a problem? Please describe. Not actually a problem but very nice to have. Especially header and footer types are not commonly used in RAG systems and excluding them from the initial response helps reducing unwanted information.

Describe the solution you'd like Such as skip_infer_table_types an exclude_types: List[str] parameter is intuitive.

Describe alternatives you've considered Using open source alternative we can create such thing with

cleaned_elements = [elm for elm in elements if element.categoty not in ['Header', 'Footer', 'Image']

Additional context Unfortunately open source alternative does not extract (or I wasn't able to) text_as_html. Which is why this api is my preference for PDF parsing.