Haystack component with the name PDFToTextConverter not found

nickchomey commented 2 years ago

Describe the bug I want to run the Demo site without Docker.

I installed Haystack with

git clone https://github.com/deepset-ai/haystack.git
cd haystack
pip install -e .[all-gpu]

Then I tried to run the Rest API server without docker, as per your documentation

gunicorn rest_api.application:app -b 0.0.0.0:8000 -k uvicorn.workers.UvicornWorker -t 300

But I get the following error

Error message

(venv) nick@DESKTOP-DIFRTR1:~/test/haystack$ gunicorn rest_api.application:app -b 0.0.0.0 -k uvicorn.workers.UvicornWorker --workers 1 --timeout 180
[2022-09-12 06:45:36 -0600] [1361] [INFO] Starting gunicorn 20.1.0
[2022-09-12 06:45:36 -0600] [1361] [INFO] Listening at: http://0.0.0.0:8000 (1361)
[2022-09-12 06:45:36 -0600] [1361] [INFO] Using worker: uvicorn.workers.UvicornWorker
[2022-09-12 06:45:36 -0600] [1363] [INFO] Booting worker with pid: 1363
[2022-09-12 06:45:39 -0600] [1363] [ERROR] Exception in worker process
Traceback (most recent call last):
  File "/home/nick/test/venv/lib/python3.8/site-packages/gunicorn/arbiter.py", line 589, in spawn_worker
    worker.init_process()
  File "/home/nick/test/venv/lib/python3.8/site-packages/uvicorn/workers.py", line 66, in init_process
    super(UvicornWorker, self).init_process()
  File "/home/nick/test/venv/lib/python3.8/site-packages/gunicorn/workers/base.py", line 134, in init_process
    self.load_wsgi()
  File "/home/nick/test/venv/lib/python3.8/site-packages/gunicorn/workers/base.py", line 146, in load_wsgi
    self.wsgi = self.app.wsgi()
  File "/home/nick/test/venv/lib/python3.8/site-packages/gunicorn/app/base.py", line 67, in wsgi
    self.callable = self.load()
  File "/home/nick/test/venv/lib/python3.8/site-packages/gunicorn/app/wsgiapp.py", line 58, in load
    return self.load_wsgiapp()
  File "/home/nick/test/venv/lib/python3.8/site-packages/gunicorn/app/wsgiapp.py", line 48, in load_wsgiapp
    return util.import_app(self.app_uri)
  File "/home/nick/test/venv/lib/python3.8/site-packages/gunicorn/util.py", line 359, in import_app
    mod = importlib.import_module(module)
  File "/usr/lib/python3.8/importlib/__init__.py", line 127, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 1014, in _gcd_import
  File "<frozen importlib._bootstrap>", line 991, in _find_and_load
  File "<frozen importlib._bootstrap>", line 975, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 671, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 848, in exec_module
  File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
  File "/home/nick/test/venv/lib/python3.8/site-packages/rest_api/application.py", line 13, in <module>
    app = get_app()
  File "/home/nick/test/venv/lib/python3.8/site-packages/rest_api/utils.py", line 28, in get_app
    from rest_api.controller import file_upload, search, feedback, document, health
  File "/home/nick/test/venv/lib/python3.8/site-packages/rest_api/controller/file_upload.py", line 20, in <module>
    indexing_pipeline: Pipeline = get_pipelines().get("indexing_pipeline", None)
  File "/home/nick/test/venv/lib/python3.8/site-packages/rest_api/utils.py", line 59, in get_pipelines
    pipelines = setup_pipelines()
  File "/home/nick/test/venv/lib/python3.8/site-packages/rest_api/pipeline/__init__.py", line 28, in setup_pipelines
    query_pipeline = Pipeline.load_from_yaml(Path(config.PIPELINE_YAML_PATH), pipeline_name=config.QUERY_PIPELINE_NAME)
  File "/home/nick/test/haystack/haystack/pipelines/base.py", line 1836, in load_from_yaml
    return cls.load_from_config(
  File "/home/nick/test/haystack/haystack/pipelines/base.py", line 1897, in load_from_config
    validate_config(pipeline_config, strict_version_check=strict_version_check)
  File "/home/nick/test/haystack/haystack/pipelines/config.py", line 247, in validate_config
    validate_pipeline_graph(pipeline_definition=pipeline_definition, component_definitions=component_definitions)
  File "/home/nick/test/haystack/haystack/pipelines/config.py", line 359, in validate_pipeline_graph
    graph = _add_node_to_pipeline_graph(graph=graph, node=node, components=component_definitions)
  File "/home/nick/test/haystack/haystack/pipelines/config.py", line 414, in _add_node_to_pipeline_graph
    node_class = _get_defined_node_class(node_name=node["name"], components=components)
  File "/home/nick/test/haystack/haystack/pipelines/config.py", line 537, in _get_defined_node_class
    node_class = BaseComponent.get_subclass(node_type)
  File "/home/nick/test/haystack/haystack/nodes/base.py", line 126, in get_subclass
    raise PipelineSchemaError(f"Haystack component with the name '{component_type}' not found.")
haystack.errors.PipelineSchemaError: Haystack component with the name 'PDFToTextConverter' not found.
[2022-09-12 06:45:39 -0600] [1363] [INFO] Worker exiting (pid: 1363)
[2022-09-12 06:45:40 -0600] [1361] [INFO] Shutting down: Master
[2022-09-12 06:45:40 -0600] [1361] [INFO] Reason: Worker failed to boot.

I then installed xpdf using the command from the Docker File (is this necessary? It isn't shown in your documentation)

wget --no-check-certificate https://dl.xpdfreader.com/xpdf-tools-linux-4.04.tar.gz && \
    tar -xvf xpdf-tools-linux-4.04.tar.gz && cp xpdf-tools-linux-4.04/bin64/pdftotext /usr/local/bin

and confirmed that it is available with

(venv) nick@DESKTOP-DIFRTR1:~/test/haystack$ pdftotext -v
pdftotext version 4.04 [www.xpdfreader.com]
Copyright 1996-2022 Glyph & Cog, LLC

System:

OS: Windows 11 WSL2, using Ubuntu 20.04
GPU/CPU: Ryzen 4600H, Geforce GTX 1650
Haystack version (commit or version number): 96bb9b5 (cloned Sept 11)
DocumentStore: Elasticsearch
Reader: whatever is default in the demo
Retriever: whatever is default in the demo

nickchomey commented 2 years ago

I'm going to close this for now - the problem seems to have gone away. My best guess is that I was accidentally using the wrong python interpreter - it would explain why pdftotext -v worked in the CLI, but not while running the python application.

nickchomey commented 2 years ago

Nevermind. I don't think it was the interpreter. I think what fixed it was that I forgot that I had removed the PDFToTextConverter stuff from /rest_api/pipeline/pipelines.haystack-pipeline.yml and then reinstalled with pip install rest_api/...

I just uninstalled rest_api and reinstalled it, after having re-cloned the repo (and restoring the yml), and I get this error again.

nickchomey commented 1 year ago

PDFToTextConverter seems to be working for me now. I really don't know what I changed... Perhaps some venv stuff...

bilgeyucel commented 1 year ago

Hi @vibha0411, let me follow up your message on #182 here as they seem related.

I could only reproduce the bug by deleting the PDFToTextConverter class in pdf.py file. Have you made any similar changes? Also, can you also share information on your OS and Haystack version?

vibha0411 commented 1 year ago

My OS is MAC Monterey. Haystack I have installed from the repo (main branch)

The only major change i have made is in /haystack/document_stores/elasticsearch.py where i am connecting to the a elastic cloud instead of localhost:9200

vibha0411 commented 1 year ago

I also reverted all the changes and still I get the error :(

bilgeyucel commented 1 year ago

I tried to reproduce your error and I could. For me, I get the error only in my miniconda3 environments. My miniforge environments work just fine. I am not sure if this is the real issue tbh, as I have limited knowledge here. I'll keep you updated 👍

vibha0411 commented 1 year ago

Yes I am using a miniconda3 environment as well... Thanks! Please keep me updated

bilgeyucel commented 1 year ago

Hi @vibha0411, updates here! Apparently, the problem is not about miniconda3 vs miniforge. Sorry for misguiding you there 😞

I realized that when we install haystack with pip install haystack/, not all necessary packages for PDFToTextConverter are installed. pdf2image and pytesseract packages need to be installed additionally. These packages are basically from OCR dependency option listed under custom installation. To install the additional packages, you can either use pip install -e '.[ocr]' in the haystack folder or try pip install -e haystack/'.[ocr]' from one level above in the directory.

vibha0411 commented 1 year ago

Thanks a loooot @bilgeyucel it finally worked!!!!!

deepset-ai / haystack

Haystack component with the name PDFToTextConverter not found #3201