Closed Timoeller closed 1 year ago
Please have a look at https://github.com/chrismattmann/tika-python/issues/313#issuecomment-723640151 for more information on how to extract structural infos with tika
I tried using tika’s PDFMarkedContentExtractor
by setting the parameter extractMarkedContent
to true
in tika’s config file. Unfortunately, this returns structural information only if the pdf is tagged. The majority of pdfs that I tried are not tagged, so this method doesn’t seem to be suited.
An alternative would be to build a heuristic using font and font size. Using tika by setting extractFontNames
to true
we get only a list of fonts used in a document, but neither the position and associated text of the font nor its size are provided.
Using pdftotext, I haven’t found a way to get font information.
pdfminer.six seems to return both font and font size.
There's a new kid on the block. Let's give pdfstructure
a try:
Maybe we can test out: https://github.com/axa-group/Parsr for getting structural information
I think this was completed in https://github.com/deepset-ai/haystack/issues/3057. Let's re-open if it was not the case.
What to do
When converting PDF documents to txt with either apache tika or pdf2text we have some functionality to split the documents by passages afterwards. It would be beneficial to have per passage meta information about the title of the pdf document and the (sub)header of the corresponding passage.
Use-case
This information might be very useful for DPR based retrieval or more rule based retrieval. Rule based retrieval on sub(header) information could be relevant if we can heavily rely on the header information containing specific keywords.