chrismattmann / tika-python

Tika-Python is a Python binding to the Apache Tika™ REST services allowing Tika to be called natively in the Python community.
Apache License 2.0
1.51k stars 235 forks source link

python tika with TESSERACT for OCR #189

Closed lathakris closed 6 years ago

lathakris commented 6 years ago

Hi,

I am running tika-server-1.18.jar within a docker container. I download and run this using my own docker file. I connect to it using the tika-python library from another container. This is not able to extract text out of the image files. I then downloaded tesseract and installed the ‘so’ files in the TIKA container and set the LD_LIBRARY_PATH etc. But still the extraction does not happen ? any idea why ? (the text extraction works fine for PDfs, DOCs etc.)

(as a debugging I downloaded the prebuilt docker image and tried it out, it works fine with the image file extraction. I see that they just download teserract in addition - https://github.com/LogicalSpark/docker-tikaserver). I do not have a tika-config file, but then I tried creating one did not help (in the tika container). Thank you in advance for your response.

10:14 $ cat tika-config.xml <?xml version="1.0" encoding="UTF-8"?>

true true false true
lathakris commented 6 years ago

BTW, I am just using the following interface of tika-python to do this. ====>>>> parsed = unpack.from_file(file, tserver)

sorry here is the config file.

10:14 $ cat tika-config.xml <?xml version="1.0" encoding="UTF-8"?>

true true false true
chrismattmann commented 6 years ago

I think the reason for this is that unpack doesn't extract text. Try the parser.from_file interface :) See: https://github.com/chrismattmann/tika-python/issues/172

lathakris commented 6 years ago

Thank you.

Does not work even with the 'parser.from_file interface'. I copied the training data for english and set the TESSDATA_PREFIX environment variable to the path, no luck!!

chrismattmann commented 6 years ago

you need to make sure tesseract is on your path and then restart tika-server in the bg, by doing ps aux | grep tika | grep server and then kill -9 <pid>.

drrmmng commented 4 years ago

I ran into this problem today. When running tika and tesseract via docker it is not in the path and I don't understand why it has to be. When using requests it just needs the tika URL and text extraction from images works fine.

chrismattmann commented 4 years ago

thanks @fedario