Closed anirbansaha96 closed 4 years ago
Hey @anirbansaha96 ,
Yes, there is a PDF converter within haystack that you can use:
from haystack.indexing.file_converters.pdf import PDFToTextConverter
converter = PDFToTextConverter(remove_header_footer=True, remove_numeric_tables=True, valid_languages=["de","en"])
pages = converter.extract_pages(file_path=file)
It comes with some basic cleaning functions as well. (see also https://github.com/deepset-ai/haystack/blob/master/README.rst#7-indexing-pdf-files)
You could use the converter in the indexing part of your pipeline like this:
converter = PDFToTextConverter(remove_header_footer=True, remove_numeric_tables=True)
document_store = ElasticsearchDocumentStore()
dicts = []
for file in Path("<PATH-TO_DIR-WITH-PDFs>").iterdir():
pages = converter.extract_pages(file_path=file)
text = "\n".join(pages)
# optional: do more cleaning here or index single pages instead of whole docs ...
dicts.append({"name": file.name, "text": text})
document_store.write_documents(dicts)
We also plan a feature that you can later display your search results directly in the original PDFs, but it's currently a bit further down the roadmap.
Hope this helps!
I just wanted to clear out the following doubts:
1) When you suggest the last line document_store.write_documents(dicts)
, this is instead of write_documents_to_db(document_store=document_store, document_dir=doc_dir, clean_func=clean_wiki_text, only_empty_db=True)
and achieves the same purpose?
2) Does this directly access the PDFs in the directory "<PATH-TO_DIR-WITH-PDFs>"
and write them directly to Document Store so that we can proceed with the usual working with retriever accessing them from the document store?
Also, what would this implementation be for using InMemoryDocumentStore()
.
The solution you provided is giving the following error:
----> from haystack.indexing.file_converters.pdf import PDFToTextConverter
ModuleNotFoundError: No module named 'haystack.indexing.file_converters.pdf'
The solution you provided is giving the following error[...]
Sorry, I should have mentioned that we only added this feature recently. If you want to try this, please install from the latest master branch (via git pull && pip install -e .
).
When you suggest the last line document_store.write_documents(dicts), this is instead of write_documents_to_db(document_store=document_store, document_dir=doc_dir, clean_func=clean_wiki_text, only_empty_db=True) and achieves the same purpose?
Yes, exactly. In an early version of haystack write_documents_to_db
did both jobs of converting files to dictionaries and writing them to the DocumentStore. After some user feedback, we found that it's better to split this into two separate methods that are easier to understand and customize. So in the latest haystack version you won't find write_documents_to_db
anymore but rather two separate functions:
1) convert_files_to_dicts()
: taking a file directory as input and returning python dictionaries incl. plain text
2) document_store.write_documents()
: indexing a list of dictionaries (e.g. coming from convert_files_to_dicts()
into your DocumentStore
Example from the current Tutorial:
Does this directly access the PDFs in the directory "
" and write them directly to Document Store so that we can proceed with the usual working with retriever accessing them from the document store?
convert_files_to_dicts()
will read & clean the PDFs from the directory and document_store.write_documents()
will do the actual "writing". If you want to debug / inspect the results of conversion, it'll be straight forward to just print the dicts returned by convert_files_to_dicts()
Also, what would this implementation be for using InMemoryDocumentStore().
Just exchanging the line with the DocumentStore should work.
converter = PDFToTextConverter(remove_header_footer=True, remove_numeric_tables=True)
document_store = InMemoryDocumentStore()
dicts = []
for file in Path("<PATH-TO_DIR-WITH-PDFs>").iterdir():
pages = converter.extract_pages(file_path=file)
text = "\n".join(pages)
# optional: do more cleaning here or index single pages instead of whole docs ...
dicts.append({"name": file.name, "text": text})
document_store.write_documents(dicts)
(Didn't test this snippet, so let me know if you face any particular issue here)
There were a few errors I encountered, some of them I've solved and wanted to share so that you can look into them for future purposes. One I'm still facing and am requesting you to resolve it.
1) When running converter = PDFToTextConverter(remove_header_footer=True, remove_numeric_tables=True)
I got an error module pdftotext is not installed
which I solved by manually running
!wget --no-check-certificate https://dl.xpdfreader.com/xpdf-tools-linux-4.02.tar.gz && tar -xvf xpdf-tools-linux-4.02.tar.gz && sudo cp xpdf-tools-linux-4.02/bin64/pdftotext /usr/local/bin
2) In the line for file in Path(doc_dir).iterdir():
it showed me an error Path is not defined
which I solved by running from pathlib import Path
3) While running the line pages = converter.extract_pages(file_path=file)
I'm currently getting the error __init__() got an unexpected keyword argument 'capture_output'
, please do look into it.
Thanks for reporting these @anirbansaha96.
Can you please post the full error message you got for 3. ? Which python version are you running?
It seems to be related to starting the subprocess via subprocess.run(command, capture_output=True, shell=False)
in _read_pdf()
.
The full error message is
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-15-af6cda17bc13> in <module>()
1 dicts = []
2 for file in Path(doc_dir).iterdir():
----> 3 pages = converter.extract_pages(file_path=file)
4 text = "\n".join(pages)
5 # optional: do more cleaning here or index single pages instead of whole docs ...
2 frames
/usr/lib/python3.6/subprocess.py in run(input, timeout, check, *popenargs, **kwargs)
421 kwargs['stdin'] = PIPE
422
--> 423 with Popen(*popenargs, **kwargs) as process:
424 try:
425 stdout, stderr = process.communicate(input, timeout=timeout)
TypeError: __init__() got an unexpected keyword argument 'capture_output'
I'm running it in Colab using Python 3.6.9
Also just to see the result, I forked the repo and changed capture_output=False
, however as is obvious it returns an empty string as an output.
It seems that the capture_output
arg is only available for Python >= 3.7 (https://docs.python.org/3.6/library/subprocess.html).
@tanaysoni can you please investigate a workaround for python 3.6?
Hi @anirbansaha96, the capture_output
param is now resolved with #194. I am closing this thread, but please feel free to open a new one if you face any further issues.
Hi, @tholor, I am stuck with the same problem as @anirbansaha96, would you mind putting up the latest version that acts as a complete replacement of his very first request.
I am still getting this error:
haystack.errors.PipelineSchemaError: Haystack component with the name 'PDFToTextConverter' not found.
Hi @vibha0411, your issue might be related to #3201. Can you give more information about the Haystack version and how you use Haystack there? Feel free to open up a new issue if it's not the same error
Hi @bilgeyucel,
Thank you for the quick response!
Its almost similar to https://github.com/deepset-ai/haystack/issues/3201.
I have installed haystack and rest_api pip install haystack\ (farm-haystack @ file:///Users/vibha/workspace/haystack) pip install rest_api\ (rest-api @ file:///Users/vibha/workspace/haystack/rest_api) However unistalling and reinstalling is not able to solve it.
This is how my pipelines.haystack-pipeline.yml looks
# To allow your IDE to autocomplete and validate your YAML pipelines, name them as <name of your choice>.haystack-pipeline.yml
version: ignore
components: # define all the building-blocks for Pipeline
- name: DocumentStore
type: ElasticsearchDocumentStore
params:
host: localhost
- name: Retriever
type: BM25Retriever
params:
document_store: DocumentStore # params can reference other components defined in the YAML
top_k: 5
- name: Reader # custom-name for the component; helpful for visualization & debugging
type: FARMReader # Haystack Class name for the component
params:
model_name_or_path: deepset/roberta-base-squad2
context_window_size: 500
return_no_answer: true
- name: TextFileConverter
type: TextConverter
- name: PDFFileConverter
type: PDFToTextConverter
- name: Preprocessor
type: PreProcessor
params:
split_by: word
split_length: 1000
- name: FileTypeClassifier
type: FileTypeClassifier
pipelines:
- name: query # a sample extractive-qa Pipeline
nodes:
- name: Retriever
inputs: [Query]
- name: Reader
inputs: [Retriever]
- name: indexing
nodes:
- name: FileTypeClassifier
inputs: [File]
- name: TextFileConverter
inputs: [FileTypeClassifier.output_1]
- name: PDFFileConverter
inputs: [FileTypeClassifier.output_2]
- name: Preprocessor
inputs: [PDFFileConverter, TextFileConverter]
- name: Retriever
inputs: [Preprocessor]
- name: DocumentStore
inputs: [Retriever]
Also
pdftotext -v
pdftotext version 4.04 [www.xpdfreader.com]
Copyright 1996-2022 Glyph & Cog, LLC
Please do let me know if any further information is required
Hi @vibha0411, I replied to you in #3201
Is there any way to directly work with PDF documents? For now, every time I need to work with PDF files, I have to convert it into a text file and then use it.
But is there any pre-built support for PDF documents.