Closed warichet closed 11 months ago
Hey thanks for bringing this up.
@ZanSara we actually said we wanted metadata extraction in our PDFToTextDocument (v2) But checking the related PR the functionality is not there. Do we want to add this still?
Hey all, I ended up here by following our discussions. What I can do @Timoeller is polish up a bit PyPDFToDocument
component and allow functional hooks to be added to PyPDFToDocument
This function hook would accept PdfReader and return Document. The default installed converter would simply do what we currently do:
text = "".join(extracted_text for page in pdf_reader.pages if (extracted_text := page.extract_text()))
That way users could easily customize their Document creation and attach to doc metadata whatever they need to.
OK thanks a lot
That s a better solution, more flexible. Perhaps title could be add in the default method.
Best regards
This is done in 2.0.0-beta1
Hello, to have a more accurate retriever, i need to add some information in meta data (in my case title of document and subject). to do that i propose to add the method :
in PDFToTextConverter class
That method could be call in
and in
With this change these informations could be used in other nodes into the pipe. Best regards Sebastien