Is your feature request related to a problem? Please describe.
Not actually a problem but very nice to have.
Especially header and footer types are not commonly used in RAG systems and excluding them from the initial response helps reducing unwanted information.
Describe the solution you'd like
Such as skip_infer_table_types an exclude_types: List[str] parameter is intuitive.
Describe alternatives you've considered
Using open source alternative we can create such thing with
cleaned_elements = [elm for elm in elements if element.categoty not in ['Header', 'Footer', 'Image']
Additional context
Unfortunately open source alternative does not extract (or I wasn't able to) text_as_html. Which is why this api is my preference for PDF parsing.
Is your feature request related to a problem? Please describe. Not actually a problem but very nice to have. Especially header and footer types are not commonly used in RAG systems and excluding them from the initial response helps reducing unwanted information.
Describe the solution you'd like Such as
skip_infer_table_types
anexclude_types: List[str]
parameter is intuitive.Describe alternatives you've considered Using open source alternative we can create such thing with
Additional context Unfortunately open source alternative does not extract (or I wasn't able to)
text_as_html
. Which is why this api is my preference for PDF parsing.