Closed Purg closed 6 years ago
Also, either the default value for checkTikaServer
port (Port
) needs to be an integer (https://github.com/chrismattmann/tika-python/blob/master/tika/tika.py#L101), or the command template (https://github.com/chrismattmann/tika-python/blob/master/tika/tika.py#L351) needs to be able to take a string (%i
-> %s
).
Primary reason for this issue: system does not have java installed, e.g. a new docker instance or ubuntu/centos/etc.
Thanks @Purg will check it
hmm I like the idea of spin loop and check that a call succeeds, but it will introduce some minor but acceptable overhead. I'll implement that @Purg thanks for the report.
What about a post-install hook to check Java using setuptools.command.install
in setup.py
? This issue has also affected some of our clients and a pip
failure might be appropriate here given the complete dependency on java.
A combination of both may be the most robust. Since java is completely detached from python and this module, java can disappear while this module sticks around in a python install tree.
happy saturday, @Purg and @chrismattmann!
i took a stab at this, adding:
TIKA_JAVA
env)TIKA_STARTUP_SLEEP
env)TIKA_STARTUP_MAX_RETRY
env)tika-server.log
(sometimes clients would touch/chmod
to create tika.log
but had not done so for tika-server.log
)
TIKA_JAVA
can be executed with subprocess.Popen
"INFO Started Apache Tika server at..."
is present in log before return True
you can find it in my feature branch here: https://github.com/mjbommar/tika-python/tree/feature-check-java-exists
commits here: https://github.com/mjbommar/tika-python/commit/3ca6c2b144a54fa4531b9e048fcf8041ab2f4fb8
apologies for pycharm's aggressive reformatting, but the real changes should be apparent in the constants, startServer
, and checkTikaServer
if one of you would like to review and test, i can fix the cosmits and PR with just the relevant lines.
mjbommar@DESKTOP C:\Users\mjbommar\PycharmProjects\tika-python
$ set TIKA_JAVA=java11
mjbommar@DESKTOP C:\Users\mjbommar\PycharmProjects\tika-python
$ ipython
Python 3.6.1 |Anaconda 4.4.0 (64-bit)| (default, May 11 2017, 13:25:24) [MSC v.1900 64 bit (AMD64)]
Type 'copyright', 'credits' or 'license' for more information
IPython 6.3.1 -- An enhanced Interactive Python. Type '?' for help.
In [1]: import tika.language
In [2]: tika.language.from_buffer("This is definitely English")
2018-06-30 09:08:16,077 [MainThread ] [ERROR] Unable to run java; is it installed?
2018-06-30 09:08:16,079 [MainThread ] [ERROR] Failed to receive startup confirmation from startServer.
---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
<ipython-input-2-dce7274abac5> in <module>()
----> 1 tika.language.from_buffer("This is definitely English")
~\PycharmProjects\tika-python\tika\language.py in from_buffer(string)
35 '''
36 status, response = callServer('put', ServerEndpoint, '/language/string', string,
---> 37 {'Accept': 'text/plain'}, False)
38 return response
~\PycharmProjects\tika-python\tika\tika.py in callServer(verb, serverEndpoint, service, data, headers, verbose, tikaServerJar, httpVerbs, classpath, rawResponse)
533 global TikaClientOnly
534 if not TikaClientOnly:
--> 535 serverEndpoint = checkTikaServer(scheme, serverHost, port, tikaServerJar, classpath)
536
537 serviceUrl = serverEndpoint + service
~\PycharmProjects\tika-python\tika\tika.py in checkTikaServer(scheme, serverHost, port, tikaServerJar, classpath)
591 if not status:
592 log.error("Failed to receive startup confirmation from startServer.")
--> 593 raise RuntimeError("Unable to start Tika server.")
594 return serverEndpoint
595
RuntimeError: Unable to start Tika server.
@mjbommar I’d be happy to review and yes please clean up and submit your PR with only the relevant lines. We should also include a README.md update in your PR with the new env vars
Just PR'd
On Sat, Jun 30, 2018, 10:27 Chris Mattmann notifications@github.com wrote:
@mjbommar https://github.com/mjbommar I’d be happy to review and yes please clean up and submit your PR with only the relevant lines. We should also include a README.md update in your PR with the new env vars
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/chrismattmann/tika-python/issues/113#issuecomment-401544572, or mute the thread https://github.com/notifications/unsubscribe-auth/AAFMPfJHCB6hiUT5gYYYM1YBqxuHFeC9ks5uB4rogaJpZM4I6IAT .
Insert this command in middle: tika.initVM()
import tika
tika.initVM()
from tika import parser
If attempted on a system without java, or if the server failed to start for whatever reason, no error is raised until an action is attempted (
checkTikaServer
does not fail if the server fails to start).Also, there is a sleep in
startServer
. Should make some actual calls to the server to check that its running correctly, i.e. spin-loop a request until it succeeds, or time/max-retry out.