deanmalmgren / textract

extract text from any document. no muss. no fuss.
http://textract.readthedocs.io
MIT License
3.91k stars 606 forks source link

Truncated File error #373

Open libgober opened 3 years ago

libgober commented 3 years ago

I am trying to extract text from hundreds of thousands of PDFs using a computer cluster. I want to run commands like

textract cl-exec-201666USCOC.pdf -o test1.txt -m tesseract

where the example PDF is from https://www.ncua.gov/files/comment-letters/2016/cl-exec-201666USCOC.pdf.

When I run this command I get a the following message:

The command tesseract /tmp/tmpwxgqr7lj/conv-4.ppm stdout failed with exit code 1 ------------- stdout ------------- b''------------- stderr ------------- b'Error in findFileFormatStream: truncated file\nError during processing.\n'

In order to run the code, the cluster requires us to make a Docker Instance. Here's the docker file I have.

FROM python:3.7 RUN echo 'deb http://ftp.us.debian.org/debian stretch main contrib non-free' >> /etc/apt/sources.list RUN apt-get update -y && apt-get install -y vim python-dev pstotext\ libxml2-dev libxslt1-dev antiword unrtf \ poppler-utils tesseract-ocr \ flac ffmpeg lame libmad0 libsox-fmt-mp3 sox libjpeg-dev swig RUN pip install ipython textract pandas tqdm beautifulsoup4 joblib

BenjaminArmijo3 commented 2 years ago

have you already fixed it?