cseas / ocr-table

Extract tables from scanned image PDFs using Optical Character Recognition.
MIT License
251 stars 63 forks source link

cannot remove './temp.tiff': No such file or directory #2

Closed zexa closed 4 years ago

zexa commented 5 years ago

Hello,

I'm running to the following issue when trying to load my own file:

$ pipenv run python3 shellocr.py
Attempting pdftotext extraction...extracted 0 words.
Attempting OCR extraction...rm: cannot remove './temp.tiff': No such file or directory
extracted 0 words.

Additionally, trying to parse the same file with pdfminer returns the following:

$ python3 pdf_miner.py
b'\x0c'
zexa commented 5 years ago

Tried running the conversion manually:

$ convert -density 300 input.pdf -depth 8 -strip -background white -alpha off ./temp.tiff
convert: attempt to perform an operation not allowed by the security policy `PDF' @ error/constitute.c/IsCoderAuthorized/408.
convert: no images defined `./temp.tiff' @ error/convert.c/ConvertImageCommand/3273.

Searching for the error links me to https://stackoverflow.com/questions/52998331/imagemagick-security-policy-pdf-blocking-conversion

Well, I added
<policy domain="coder" rights="read | write" pattern="PDF" />
just before </policymap> in /etc/ImageMagick-7/policy.xml and that makes it work again, but not sure about the security implications of that.

Hopefully this helps somebody.

edzob commented 4 years ago

I confirm this solution. I have used the following resource and changed the value for PDF https://alexvanderbist.com/posts/2018/fixing-imagick-error-unauthorized