chrismattmann / tika-python

Tika-Python is a Python binding to the Apache Tika™ REST services allowing Tika to be called natively in the Python community.
Apache License 2.0
1.51k stars 234 forks source link

Tika-Python on Windows: Tika server returns status 503 #206

Closed tsela closed 5 years ago

tsela commented 5 years ago

I am testing Tika-Python on my Windows 10 laptop, but I cannot get it to work. Using the following Python script (directly taken from this site, with 'path/to/file' naturally changed to a correct filepath):

"""Test Apache Tika."""

import tika
tika.initVM()
from tika import parser

parsed = parser.from_file('path/to/file')

print(parsed['metadata'])
print(parsed['content'])

I get the following:

$ python test-tika.py
2018-11-22 10:24:34,112 [MainThread  ] [WARNI]  Tika server returned status: 503
Traceback (most recent call last):
  File "test-tika.py", line 7, in <module>
    parsed = parser.from_file('C:\\Users\\Christophe.Grandsire\\Cases\\Data\\AI in FDP\\Raw Pilot data\\magnus C&C.pdf')
  File "C:\Users\Christophe.Grandsire\.virtualenvs\Tika-OOOIfOBP\lib\site-packages\tika\parser.py", line 40, in from_file
    return _parse(jsonOutput)
  File "C:\Users\Christophe.Grandsire\.virtualenvs\Tika-OOOIfOBP\lib\site-packages\tika\parser.py", line 77, in _parse
    realJson = json.loads(jsonOutput[1])
  File "c:\program files\python37\Lib\json\__init__.py", line 348, in loads
    return _default_decoder.decode(s)
  File "c:\program files\python37\Lib\json\decoder.py", line 337, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "c:\program files\python37\Lib\json\decoder.py", line 355, in raw_decode
    raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

Here is the contents of tika.log, which shows that the tika-server JAR was correctly downloaded, but that every attempt to use it returns a 503 status code:

2018-11-21 15:24:59,746 [MainThread  ] [INFO ]  Retrieving http://search.maven.org/remotecontent?filepath=org/apache/tika/tika-server/1.19/tika-server-1.19.jar to C:\Users\CHRIST~1.GRA\AppData\Local\Temp\tika-server.jar.
2018-11-21 15:27:44,758 [MainThread  ] [INFO ]  Retrieving http://search.maven.org/remotecontent?filepath=org/apache/tika/tika-server/1.19/tika-server-1.19.jar.md5 to C:\Users\CHRIST~1.GRA\AppData\Local\Temp\tika-server.jar.md5.
2018-11-21 15:27:45,319 [MainThread  ] [WARNI]  Failed to see startup log message; retrying...
2018-11-21 15:28:03,322 [MainThread  ] [WARNI]  Tika server returned status: 503
2018-11-21 15:35:25,203 [MainThread  ] [WARNI]  Tika server returned status: 503
2018-11-21 15:35:56,649 [MainThread  ] [WARNI]  Tika server returned status: 503
2018-11-22 10:23:32,192 [MainThread  ] [WARNI]  Failed to see startup log message; retrying...
2018-11-22 10:23:55,192 [MainThread  ] [WARNI]  Tika server returned status: 503
2018-11-22 10:24:34,112 [MainThread  ] [WARNI]  Tika server returned status: 503

And here is the latest tika-server.log file:

nov. 22, 2018 10:23:33 A.M. org.apache.tika.config.InitializableProblemHandler$3 handleInitializableProblem
WARNING: J2KImageReader not loaded. JPEG2000 files will not be processed.
See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io
for optional dependencies.

nov. 22, 2018 10:23:33 A.M. org.apache.tika.config.InitializableProblemHandler$3 handleInitializableProblem
WARNING: Tesseract OCR is installed and will be automatically applied to image files unless
you've excluded the TesseractOCRParser from the default parser.
Tesseract may dramatically slow down content extraction (TIKA-2359).
As of Tika 1.15 (and prior versions), Tesseract is automatically called.
In future versions of Tika, users may need to turn the TesseractOCRParser on via TikaConfig.
nov. 22, 2018 10:23:33 A.M. org.apache.tika.config.InitializableProblemHandler$3 handleInitializableProblem
WARNING: org.xerial's sqlite-jdbc is not loaded.
Please provide the jar on your classpath to parse sqlite files.
See tika-parsers/pom.xml for the correct version.
INFO  Starting Apache Tika 1.19 server
INFO  Setting the server's publish address to be http://0.0.0.0:9998/
INFO  Logging initialized @1629ms to org.eclipse.jetty.util.log.Slf4jLog
INFO  jetty-9.4.z-SNAPSHOT; built: 2018-06-05T18:24:03.829Z; git: d5fc0523cfa96bfebfbda19606cad384d772f04c; jvm 11.0.1+13-LTS
INFO  Started ServerConnector@63648ee9{HTTP/1.1,[http/1.1]}{0.0.0.0:9998}
INFO  Started @2108ms
WARN  Empty contextPath
INFO  Started o.e.j.s.h.ContextHandler@1536602f{/,null,AVAILABLE}
INFO  Started Apache Tika server at http://0.0.0.0:9998/

Any idea what is going on here?

chrismattmann commented 5 years ago

not sure @tsela I don't have an active Windows 10 VM, only Win7 Ultimate. Looks like it can't detect that the server has started. can you check the code in tika.py where it tries to confirm that tika has started?

chrismattmann commented 5 years ago

never heard back so I assume this is OK now.

tsela commented 5 years ago

Sorry for never getting back to you on this, but I moved my development setup to Linux, where this problem does not occur.

jaksiprejak commented 5 years ago

Hi, I have this issue too. I'm on Windows 10 as well.

What information do you need?

LouiseDupuis commented 5 years ago

Hi, I have this problem as well. Every once in a while the server stops responding and then I have to restart the whole thing (my jupyter notebooks to be precise). If someone has any idea why this happens on Windows...

JeanB-Verr commented 4 years ago

I have the same problem on windows 10

BorisWiegand commented 4 years ago

I also have this problem on windows 10. this time, the port is not an issue. The tika server does not give me any log information, however, "netstat -ab" shows that it is listening on the given port. Tthe tike log just says: 2020-03-27 09:01:54,564 [MainThread ] [WARNI] Tika server returned status: 503

chrismattmann commented 4 years ago

@BorisWiegand can you take a look at https://cwiki.apache.org/confluence/display/TIKA/TikaOCR and try to interact with Tika server that way? Does it work? that will isolate the problem to whether or not it's an issue in python or your server setup.

BorisWiegand commented 4 years ago

@BorisWiegand can you take a look at https://cwiki.apache.org/confluence/display/TIKA/TikaOCR and try to interact with Tika server that way? Does it work? that will isolate the problem to whether or not it's an issue in python or your server setup.

Thank you very much for this hint. I am sitting behind a corporate proxy and I had a wrong configuration, such that my python script tried to connect via proxy to the local tika server. Actually not the tika server but the proxy server returned status code 503. Now, I have fixed my proxy settings and everythings works as expected.

chrismattmann commented 4 years ago

awesome @BorisWiegand thanks