Closed AlviseSembenico closed 3 years ago
Hey @AlviseSembenico thanks for using our dockerized version.
How are you using the docker? Are you interacting with code through a jupyter notebook or do you just use the APIs?
Since the pdf2text is an optional package and not installed, have you tried installing it yourself with the mthods proposed in the printouts?
We also have PDF conversion through Apache Tika with our TikaConverter:
from pathlib import Path
from haystack.file_converter.tika import TikaConverter
converter = TikaConverter(remove_numeric_tables=True, valid_languages=["en"])
doc_pdf = converter.convert(file_path=Path("data/preprocessing_tutorial/bert.pdf"), meta=None)
Have you tried this? This needs a running Tika server though that you start with docker run -p 9998:9998 apache/tika:1.24.1
@Timoeller But it seems in dockerfile for GPU it's being installed. https://github.com/deepset-ai/haystack/blob/master/Dockerfile-GPU#L17 So might be some issue on building and releasing these docker images
Thats a good point @lalitpagaria , pdf2text should be installed.
I was trying to replicate the issue by building the CPU version which uses the same code for installing pdf2text. When trying to use pdf2text there it works:
docker exec -it 8d5bd0d05f17 bash
root@8d5bd0d05f17:/home/user# python
Python 3.7.4 (default, Oct 17 2019, 06:18:21)
[GCC 6.3.0 20170516] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from haystack.file_converter.pdf import PDFToTextConverter
>>> converter = PDFToTextConverter(remove_numeric_tables=True, valid_languages=["en"])
pdftotext version 4.03 [www.xpdfreader.com]
Copyright 1996-2021 Glyph & Cog, LLC
Edit:
I was able to replicate the issue by just executing docker run deepset/haystack-gpu:latest
. @oryx1729 could you look into this please?
Hi @Timoeller thank you for your reply! My current usage is starting from your Docker image, add my code to it and run the REST API. During the debugging process, I isolated the error, in the same way, to were able to replicate it.
Furthermore, I tried to install the pdf2text directly in my Docker but with no success.
According to this post, I added the following packages to the image and the problem seemed solved.
RUN apt-get install libpoppler-cpp-dev pkg-config -y --fix-missing
Shall I do a PR?
Nice one, yes, we would appreciate a PR.
Sure, here the related PR https://github.com/deepset-ai/haystack/pull/1107
Describe the bug pdfto text does not seem to be working from the GPU Docker
Error message
Additional context The problem occurs with the following command
docker run deepset/haystack-gpu:latest
System: