Extract passage headers during processing of PDF documents

deepset-ai / haystack

:mag: AI orchestration framework to build customizable, production-ready LLM applications. Connect components (models, vector DBs, file converters) to pipelines or agents that can interact with your data. With advanced retrieval methods, it's best suited for building RAG, question answering, semantic search or conversational agent chatbots.

https://haystack.deepset.ai

Apache License 2.0

16.77k stars 1.84k forks source link

Extract passage headers during processing of PDF documents #482

Closed Timoeller closed 1 year ago

Timoeller commented 3 years ago

What to do

When converting PDF documents to txt with either apache tika or pdf2text we have some functionality to split the documents by passages afterwards. It would be beneficial to have per passage meta information about the title of the pdf document and the (sub)header of the corresponding passage.

Use-case

This information might be very useful for DPR based retrieval or more rule based retrieval. Rule based retrieval on sub(header) information could be relevant if we can heavily rely on the header information containing specific keywords.

Timoeller commented 3 years ago

Please have a look at https://github.com/chrismattmann/tika-python/issues/313#issuecomment-723640151 for more information on how to extract structural infos with tika

bogdankostic commented 3 years ago

I tried using tika’s PDFMarkedContentExtractor by setting the parameter extractMarkedContent to true in tika’s config file. Unfortunately, this returns structural information only if the pdf is tagged. The majority of pdfs that I tried are not tagged, so this method doesn’t seem to be suited.

An alternative would be to build a heuristic using font and font size. Using tika by setting extractFontNames to true we get only a list of fonts used in a document, but neither the position and associated text of the font nor its size are provided. Using pdftotext, I haven’t found a way to get font information. pdfminer.six seems to return both font and font size.

tholor commented 3 years ago

There's a new kid on the block. Let's give pdfstructure a try:

Timoeller commented 3 years ago

Maybe we can test out: https://github.com/axa-group/Parsr for getting structural information

ZanSara commented 1 year ago

I think this was completed in https://github.com/deepset-ai/haystack/issues/3057. Let's re-open if it was not the case.