Open libgober opened 3 years ago
I am trying to extract text from hundreds of thousands of PDFs using a computer cluster. I want to run commands like
textract cl-exec-201666USCOC.pdf -o test1.txt -m tesseract
where the example PDF is from https://www.ncua.gov/files/comment-letters/2016/cl-exec-201666USCOC.pdf.
When I run this command I get a the following message:
The command tesseract /tmp/tmpwxgqr7lj/conv-4.ppm stdout failed with exit code 1 ------------- stdout ------------- b''------------- stderr ------------- b'Error in findFileFormatStream: truncated file\nError during processing.\n'
tesseract /tmp/tmpwxgqr7lj/conv-4.ppm stdout
In order to run the code, the cluster requires us to make a Docker Instance. Here's the docker file I have.
FROM python:3.7 RUN echo 'deb http://ftp.us.debian.org/debian stretch main contrib non-free' >> /etc/apt/sources.list RUN apt-get update -y && apt-get install -y vim python-dev pstotext\ libxml2-dev libxslt1-dev antiword unrtf \ poppler-utils tesseract-ocr \ flac ffmpeg lame libmad0 libsox-fmt-mp3 sox libjpeg-dev swig RUN pip install ipython textract pandas tqdm beautifulsoup4 joblib
have you already fixed it?
I am trying to extract text from hundreds of thousands of PDFs using a computer cluster. I want to run commands like
textract cl-exec-201666USCOC.pdf -o test1.txt -m tesseract
where the example PDF is from https://www.ncua.gov/files/comment-letters/2016/cl-exec-201666USCOC.pdf.
When I run this command I get a the following message:
The command
tesseract /tmp/tmpwxgqr7lj/conv-4.ppm stdout
failed with exit code 1 ------------- stdout ------------- b''------------- stderr ------------- b'Error in findFileFormatStream: truncated file\nError during processing.\n'In order to run the code, the cluster requires us to make a Docker Instance. Here's the docker file I have.
FROM python:3.7 RUN echo 'deb http://ftp.us.debian.org/debian stretch main contrib non-free' >> /etc/apt/sources.list RUN apt-get update -y && apt-get install -y vim python-dev pstotext\ libxml2-dev libxslt1-dev antiword unrtf \ poppler-utils tesseract-ocr \ flac ffmpeg lame libmad0 libsox-fmt-mp3 sox libjpeg-dev swig RUN pip install ipython textract pandas tqdm beautifulsoup4 joblib