deepset-ai / haystack

AI orchestration framework to build customizable, production-ready LLM applications. Connect components (models, vector DBs, file converters) to pipelines or agents that can interact with your data. With advanced retrieval methods, it's best suited for building RAG, question answering, semantic search or conversational agent chatbots.
https://haystack.deepset.ai
Apache License 2.0
17.68k stars 1.92k forks source link

For Pdf file add title and subject in meta data #5963

Closed warichet closed 11 months ago

warichet commented 1 year ago

Hello, to have a more accurate retriever, i need to add some information in meta data (in my case title of document and subject). to do that i propose to add the method :

    # Add title and subject in meta data
    def _add_meta(
        self, 
        doc, 
        meta: dict):
        title = doc.metadata.get("title", "")
        if title != "":
            meta['document.title'] = title
        subject = doc.metadata.get("subject", "")
        if subject != "":
            meta['document.subject'] = subject

in PDFToTextConverter class

That method could be call in

`    def _get_text_parallel(self, page_mp):
        idx, filename, meta, parts, sort_by_position, ocr, ocr_language = page_mp

        doc = fitz.open(filename)
        self._add_meta(doc, meta)
         .......
        return text

and in

`    def _read_pdf(
     ...............
        doc = fitz.open(file_path)
        self._add_meta(doc, meta)
      ................

With this change these informations could be used in other nodes into the pipe. Best regards Sebastien

Timoeller commented 1 year ago

Hey thanks for bringing this up.

@ZanSara we actually said we wanted metadata extraction in our PDFToTextDocument (v2) But checking the related PR the functionality is not there. Do we want to add this still?

vblagoje commented 1 year ago

Hey all, I ended up here by following our discussions. What I can do @Timoeller is polish up a bit PyPDFToDocument component and allow functional hooks to be added to PyPDFToDocument This function hook would accept PdfReader and return Document. The default installed converter would simply do what we currently do:

text = "".join(extracted_text for page in pdf_reader.pages if (extracted_text := page.extract_text()))

That way users could easily customize their Document creation and attach to doc metadata whatever they need to.

warichet commented 11 months ago

OK thanks a lot

That s a better solution, more flexible. Perhaps title could be add in the default method.

Best regards

masci commented 11 months ago

This is done in 2.0.0-beta1