deepset-ai / haystack

AI orchestration framework to build customizable, production-ready LLM applications. Connect components (models, vector DBs, file converters) to pipelines or agents that can interact with your data. With advanced retrieval methods, it's best suited for building RAG, question answering, semantic search or conversational agent chatbots.
https://haystack.deepset.ai
Apache License 2.0
17.96k stars 1.93k forks source link

PDF Support #182

Closed anirbansaha96 closed 4 years ago

anirbansaha96 commented 4 years ago

Is there any way to directly work with PDF documents? For now, every time I need to work with PDF files, I have to convert it into a text file and then use it.

import PyPDF2
article=[]
pdfFileObj = open('data/article/doc.pdf', 'rb') 
pdfReader = PyPDF2.PdfFileReader(pdfFileObj) 
for page in range(pdfReader.numPages):
  pageObj = pdfReader.getPage(page)
  article.append(pageObj.extractText())
  print(article)
article_text = open('data/article/doc.txt',"w")
article_text.writelines(article)
write_documents_to_db(document_store=document_store, document_dir=doc_dir, clean_func=clean_wiki_text, only_empty_db=True)

But is there any pre-built support for PDF documents.

tholor commented 4 years ago

Hey @anirbansaha96 ,

Yes, there is a PDF converter within haystack that you can use:

from haystack.indexing.file_converters.pdf import PDFToTextConverter

converter = PDFToTextConverter(remove_header_footer=True, remove_numeric_tables=True, valid_languages=["de","en"])
pages = converter.extract_pages(file_path=file)

It comes with some basic cleaning functions as well. (see also https://github.com/deepset-ai/haystack/blob/master/README.rst#7-indexing-pdf-files)

You could use the converter in the indexing part of your pipeline like this:

converter = PDFToTextConverter(remove_header_footer=True, remove_numeric_tables=True)
document_store = ElasticsearchDocumentStore()

dicts = []
for file in Path("<PATH-TO_DIR-WITH-PDFs>").iterdir():
    pages = converter.extract_pages(file_path=file)
    text = "\n".join(pages)
    # optional: do more cleaning here or index single pages instead of whole docs ...
    dicts.append({"name": file.name, "text": text})

document_store.write_documents(dicts)

We also plan a feature that you can later display your search results directly in the original PDFs, but it's currently a bit further down the roadmap.

Hope this helps!

anirbansaha96 commented 4 years ago

I just wanted to clear out the following doubts: 1) When you suggest the last line document_store.write_documents(dicts), this is instead of write_documents_to_db(document_store=document_store, document_dir=doc_dir, clean_func=clean_wiki_text, only_empty_db=True) and achieves the same purpose?

2) Does this directly access the PDFs in the directory "<PATH-TO_DIR-WITH-PDFs>" and write them directly to Document Store so that we can proceed with the usual working with retriever accessing them from the document store?

anirbansaha96 commented 4 years ago

Also, what would this implementation be for using InMemoryDocumentStore().

anirbansaha96 commented 4 years ago

The solution you provided is giving the following error:

----> from haystack.indexing.file_converters.pdf import PDFToTextConverter

ModuleNotFoundError: No module named 'haystack.indexing.file_converters.pdf'
tholor commented 4 years ago

The solution you provided is giving the following error[...]

Sorry, I should have mentioned that we only added this feature recently. If you want to try this, please install from the latest master branch (via git pull && pip install -e .).

When you suggest the last line document_store.write_documents(dicts), this is instead of write_documents_to_db(document_store=document_store, document_dir=doc_dir, clean_func=clean_wiki_text, only_empty_db=True) and achieves the same purpose?

Yes, exactly. In an early version of haystack write_documents_to_db did both jobs of converting files to dictionaries and writing them to the DocumentStore. After some user feedback, we found that it's better to split this into two separate methods that are easier to understand and customize. So in the latest haystack version you won't find write_documents_to_db anymore but rather two separate functions: 1) convert_files_to_dicts(): taking a file directory as input and returning python dictionaries incl. plain text 2) document_store.write_documents(): indexing a list of dictionaries (e.g. coming from convert_files_to_dicts() into your DocumentStore

Example from the current Tutorial:

https://github.com/deepset-ai/haystack/blob/84a25c73b3e3d80a0dc02f97876c9ef51f4a1c95/tutorials/Tutorial1_Basic_QA_Pipeline.py#L73-L80

Does this directly access the PDFs in the directory "" and write them directly to Document Store so that we can proceed with the usual working with retriever accessing them from the document store?

convert_files_to_dicts() will read & clean the PDFs from the directory and document_store.write_documents() will do the actual "writing". If you want to debug / inspect the results of conversion, it'll be straight forward to just print the dicts returned by convert_files_to_dicts()

Also, what would this implementation be for using InMemoryDocumentStore().

Just exchanging the line with the DocumentStore should work.

converter = PDFToTextConverter(remove_header_footer=True, remove_numeric_tables=True)
document_store = InMemoryDocumentStore()

dicts = []
for file in Path("<PATH-TO_DIR-WITH-PDFs>").iterdir():
    pages = converter.extract_pages(file_path=file)
    text = "\n".join(pages)
    # optional: do more cleaning here or index single pages instead of whole docs ...
    dicts.append({"name": file.name, "text": text})

document_store.write_documents(dicts)

(Didn't test this snippet, so let me know if you face any particular issue here)

anirbansaha96 commented 4 years ago

There were a few errors I encountered, some of them I've solved and wanted to share so that you can look into them for future purposes. One I'm still facing and am requesting you to resolve it.

1) When running converter = PDFToTextConverter(remove_header_footer=True, remove_numeric_tables=True) I got an error module pdftotext is not installed which I solved by manually running !wget --no-check-certificate https://dl.xpdfreader.com/xpdf-tools-linux-4.02.tar.gz && tar -xvf xpdf-tools-linux-4.02.tar.gz && sudo cp xpdf-tools-linux-4.02/bin64/pdftotext /usr/local/bin

2) In the line for file in Path(doc_dir).iterdir(): it showed me an error Path is not defined which I solved by running from pathlib import Path

3) While running the line pages = converter.extract_pages(file_path=file) I'm currently getting the error __init__() got an unexpected keyword argument 'capture_output', please do look into it.

tholor commented 4 years ago

Thanks for reporting these @anirbansaha96. Can you please post the full error message you got for 3. ? Which python version are you running? It seems to be related to starting the subprocess via subprocess.run(command, capture_output=True, shell=False) in _read_pdf().

anirbansaha96 commented 4 years ago

The full error message is

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-15-af6cda17bc13> in <module>()
      1 dicts = []
      2 for file in Path(doc_dir).iterdir():
----> 3     pages = converter.extract_pages(file_path=file)
      4     text = "\n".join(pages)
      5     # optional: do more cleaning here or index single pages instead of whole docs ...

2 frames
/usr/lib/python3.6/subprocess.py in run(input, timeout, check, *popenargs, **kwargs)
    421         kwargs['stdin'] = PIPE
    422 
--> 423     with Popen(*popenargs, **kwargs) as process:
    424         try:
    425             stdout, stderr = process.communicate(input, timeout=timeout)

TypeError: __init__() got an unexpected keyword argument 'capture_output'

I'm running it in Colab using Python 3.6.9

anirbansaha96 commented 4 years ago

Also just to see the result, I forked the repo and changed capture_output=False, however as is obvious it returns an empty string as an output.

tholor commented 4 years ago

It seems that the capture_output arg is only available for Python >= 3.7 (https://docs.python.org/3.6/library/subprocess.html).

@tanaysoni can you please investigate a workaround for python 3.6?

tanaysoni commented 4 years ago

Hi @anirbansaha96, the capture_output param is now resolved with #194. I am closing this thread, but please feel free to open a new one if you face any further issues.

1ssb commented 2 years ago

Hi, @tholor, I am stuck with the same problem as @anirbansaha96, would you mind putting up the latest version that acts as a complete replacement of his very first request.

vibha0411 commented 1 year ago

I am still getting this error:

haystack.errors.PipelineSchemaError: Haystack component with the name 'PDFToTextConverter' not found.

bilgeyucel commented 1 year ago

Hi @vibha0411, your issue might be related to #3201. Can you give more information about the Haystack version and how you use Haystack there? Feel free to open up a new issue if it's not the same error

vibha0411 commented 1 year ago

Hi @bilgeyucel,

Thank you for the quick response!

Its almost similar to https://github.com/deepset-ai/haystack/issues/3201.

I have installed haystack and rest_api pip install haystack\ (farm-haystack @ file:///Users/vibha/workspace/haystack) pip install rest_api\ (rest-api @ file:///Users/vibha/workspace/haystack/rest_api) However unistalling and reinstalling is not able to solve it.

vibha0411 commented 1 year ago

This is how my pipelines.haystack-pipeline.yml looks

# To allow your IDE to autocomplete and validate your YAML pipelines, name them as <name of your choice>.haystack-pipeline.yml

version: ignore

components:    # define all the building-blocks for Pipeline
  - name: DocumentStore
    type: ElasticsearchDocumentStore
    params:
      host: localhost
  - name: Retriever
    type: BM25Retriever
    params:
      document_store: DocumentStore    # params can reference other components defined in the YAML
      top_k: 5
  - name: Reader       # custom-name for the component; helpful for visualization & debugging
    type: FARMReader    # Haystack Class name for the component
    params:
      model_name_or_path: deepset/roberta-base-squad2
      context_window_size: 500
      return_no_answer: true
  - name: TextFileConverter
    type: TextConverter
  - name: PDFFileConverter
    type: PDFToTextConverter
  - name: Preprocessor
    type: PreProcessor
    params:
      split_by: word
      split_length: 1000
  - name: FileTypeClassifier
    type: FileTypeClassifier

pipelines:
  - name: query    # a sample extractive-qa Pipeline
    nodes:
      - name: Retriever
        inputs: [Query]
      - name: Reader
        inputs: [Retriever]
  - name: indexing
    nodes:
      - name: FileTypeClassifier
        inputs: [File]
      - name: TextFileConverter
        inputs: [FileTypeClassifier.output_1]
      - name: PDFFileConverter
        inputs: [FileTypeClassifier.output_2]
      - name: Preprocessor
        inputs: [PDFFileConverter, TextFileConverter]
      - name: Retriever
        inputs: [Preprocessor]
      - name: DocumentStore
        inputs: [Retriever]

Also

pdftotext -v                                                                             
pdftotext version 4.04 [www.xpdfreader.com]
Copyright 1996-2022 Glyph & Cog, LLC

Please do let me know if any further information is required

bilgeyucel commented 1 year ago

Hi @vibha0411, I replied to you in #3201