deepset-ai / haystack

:mag: AI orchestration framework to build customizable, production-ready LLM applications. Connect components (models, vector DBs, file converters) to pipelines or agents that can interact with your data. With advanced retrieval methods, it's best suited for building RAG, question answering, semantic search or conversational agent chatbots.
https://haystack.deepset.ai
Apache License 2.0
16.82k stars 1.85k forks source link

PDFTOTEXT does not seem to be installed on GPU Docker #1094

Closed AlviseSembenico closed 3 years ago

AlviseSembenico commented 3 years ago

Describe the bug pdfto text does not seem to be working from the GPU Docker

Error message

               File "/home/user/rest_api/controller/router.py", line 3, in <module>
                  from rest_api.controller import search, file_manager
                File "/home/user/rest_api/controller/file_manager.py", line 20, in <module>
                  pdf_converter = PDFToTextConverter(remove_numeric_tables=True, valid_languages=["de","en"])
                File "/home/user/haystack/file_converter/pdf.py", line 38, in __init__
                  """
              Exception: pdftotext is not installed. It is part of xpdf or poppler-utils software suite.

               Installation on Linux:
               wget --no-check-certificate https://dl.xpdfreader.com/xpdf-tools-linux-4.03.tar.gz &&
               tar -xvf xpdf-tools-linux-4.03.tar.gz && sudo cp xpdf-tools-linux-4.03/bin64/pdftotext /usr/local/bin

               Installation on MacOS:
               brew install xpdf

               You can find more details here: https://www.xpdfreader.com

Additional context The problem occurs with the following command docker run deepset/haystack-gpu:latest

System:

Timoeller commented 3 years ago

Hey @AlviseSembenico thanks for using our dockerized version.

usage

How are you using the docker? Are you interacting with code through a jupyter notebook or do you just use the APIs?

install pdf2text

Since the pdf2text is an optional package and not installed, have you tried installing it yourself with the mthods proposed in the printouts?

alternative through tika pdf conversion

We also have PDF conversion through Apache Tika with our TikaConverter:

from pathlib import Path
from haystack.file_converter.tika import TikaConverter

converter = TikaConverter(remove_numeric_tables=True, valid_languages=["en"])
doc_pdf = converter.convert(file_path=Path("data/preprocessing_tutorial/bert.pdf"), meta=None)

Have you tried this? This needs a running Tika server though that you start with docker run -p 9998:9998 apache/tika:1.24.1

lalitpagaria commented 3 years ago

@Timoeller But it seems in dockerfile for GPU it's being installed. https://github.com/deepset-ai/haystack/blob/master/Dockerfile-GPU#L17 So might be some issue on building and releasing these docker images

Timoeller commented 3 years ago

Thats a good point @lalitpagaria , pdf2text should be installed.

I was trying to replicate the issue by building the CPU version which uses the same code for installing pdf2text. When trying to use pdf2text there it works:

docker exec -it 8d5bd0d05f17 bash
root@8d5bd0d05f17:/home/user# python
Python 3.7.4 (default, Oct 17 2019, 06:18:21) 
[GCC 6.3.0 20170516] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from haystack.file_converter.pdf import PDFToTextConverter
>>> converter = PDFToTextConverter(remove_numeric_tables=True, valid_languages=["en"])
pdftotext version 4.03 [www.xpdfreader.com]
Copyright 1996-2021 Glyph & Cog, LLC

Edit:

I was able to replicate the issue by just executing docker run deepset/haystack-gpu:latest. @oryx1729 could you look into this please?

AlviseSembenico commented 3 years ago

Hi @Timoeller thank you for your reply! My current usage is starting from your Docker image, add my code to it and run the REST API. During the debugging process, I isolated the error, in the same way, to were able to replicate it.

Furthermore, I tried to install the pdf2text directly in my Docker but with no success.

AlviseSembenico commented 3 years ago

According to this post, I added the following packages to the image and the problem seemed solved.

RUN apt-get install libpoppler-cpp-dev pkg-config -y --fix-missing

Shall I do a PR?

Timoeller commented 3 years ago

Nice one, yes, we would appreciate a PR.

AlviseSembenico commented 3 years ago

Sure, here the related PR https://github.com/deepset-ai/haystack/pull/1107