Closed lathakris closed 6 years ago
BTW, I am just using the following interface of tika-python to do this. ====>>>> parsed = unpack.from_file(file, tserver)
sorry here is the config file.
10:14 $ cat tika-config.xml <?xml version="1.0" encoding="UTF-8"?>
I think the reason for this is that unpack doesn't extract text. Try the parser.from_file interface :) See: https://github.com/chrismattmann/tika-python/issues/172
Thank you.
Does not work even with the 'parser.from_file interface'. I copied the training data for english and set the TESSDATA_PREFIX environment variable to the path, no luck!!
you need to make sure tesseract is on your path and then restart tika-server in the bg, by doing ps aux | grep tika | grep server
and then kill -9 <pid>
.
I ran into this problem today. When running tika and tesseract via docker it is not in the path and I don't understand why it has to be.
When using requests
it just needs the tika URL and text extraction from images works fine.
thanks @fedario
Hi,
I am running tika-server-1.18.jar within a docker container. I download and run this using my own docker file. I connect to it using the tika-python library from another container. This is not able to extract text out of the image files. I then downloaded tesseract and installed the ‘so’ files in the TIKA container and set the LD_LIBRARY_PATH etc. But still the extraction does not happen ? any idea why ? (the text extraction works fine for PDfs, DOCs etc.)
(as a debugging I downloaded the prebuilt docker image and tried it out, it works fine with the image file extraction. I see that they just download teserract in addition - https://github.com/LogicalSpark/docker-tikaserver). I do not have a tika-config file, but then I tried creating one did not help (in the tika container). Thank you in advance for your response.
10:14 $ cat tika-config.xml <?xml version="1.0" encoding="UTF-8"?>