chrismattmann / tika-python

Tika-Python is a Python binding to the Apache Tika™ REST services allowing Tika to be called natively in the Python community.
Apache License 2.0
1.51k stars 234 forks source link

Airgap Environment Setup is unable to start Tika server #390

Closed Marcos-A closed 1 year ago

Marcos-A commented 1 year ago

Setting the TIKA_SERVER_JAR environment variable to a local file successfully tells python-tika to "download" this file and move it to /tmp/tika-server.jar, but returns the following errors:

2023-02-20 07:16:35,173 [MainThread  ] [INFO ]  Retrieving file:////opt/pdf2xlsx/tika_server_files/tika-server-standard-2.6.0.jar to /tmp/tika-server.jar.
2023-02-20 07:16:35,180 [MainThread  ] [INFO ]  Retrieving file:////opt/pdf2xlsx/tika_server_files/tika-server-standard-2.6.0.jar.md5 to /tmp/tika-server.jar.md5.
2023-02-20 07:16:35,180 [MainThread  ] [INFO ]  Retrieving file:////opt/pdf2xlsx/tika_server_files/tika-server-standard-2.6.0.jar to /tmp/tika-server.jar.
2023-02-20 07:16:35,189 [MainThread  ] [WARNI]  Failed to see startup log message; retrying...
2023-02-20 07:16:40,190 [MainThread  ] [WARNI]  Failed to see startup log message; retrying...
2023-02-20 07:16:45,191 [MainThread  ] [WARNI]  Failed to see startup log message; retrying...
2023-02-20 07:16:50,192 [MainThread  ] [ERROR]  Tika startup log message not received after 3 tries.
2023-02-20 07:16:50,194 [MainThread  ] [ERROR]  Failed to receive startup confirmation from startServer.
Unable to start Tika server.

Installed on a Docker container:

Docker base image: FROM python:3.11
Kernel version: Linux afabe325295c 5.15.0-60-generic #66-Ubuntu SMP Fri Jan 20 14:29:49 UTC 2023 x86_64 GNU/Linux
Distribution info: Debian GNU/Linux 11

Local JAR files downloaded:

Java version:

openjdk version "11.0.18" 2023-01-17
OpenJDK Runtime Environment (build 11.0.18+10-post-Debian-1deb11u1)
OpenJDK 64-Bit Server VM (build 11.0.18+10-post-Debian-1deb11u1, mixed mode, sharing)

Setup works as expected after unsetting the TIKA_SERVER_JAR environment variable, forcing python-tika to download the same version of the JAR file from Apache.

I tried setting the JAVA_HOME environment variable to the Java installation directory, as well as the PATH variable, with no further success.

chrismattmann commented 1 year ago

what does the /tmp/tika*log file return? I need to see what that is showing to debug this thanks @Marcos-A

Marcos-A commented 1 year ago

Hi, @chrismattmann, thank you for your response. This is the content of the tika.log file:

2023-02-23 23:55:03,003 [MainThread  ] [INFO ]  Retrieving file:////opt/pdf2xlsx/tika_server_files/tika-server-standard-2.6.0.jar to /tmp/tika-server.jar.
2023-02-23 23:55:03,016 [MainThread  ] [INFO ]  Retrieving file:////opt/pdf2xlsx/tika_server_files/tika-server-standard-2.6.0.jar.md5 to /tmp/tika-server.jar.md5.
2023-02-23 23:55:03,017 [MainThread  ] [INFO ]  Retrieving file:////opt/pdf2xlsx/tika_server_files/tika-server-standard-2.6.0.jar to /tmp/tika-server.jar.
2023-02-23 23:55:03,030 [MainThread  ] [WARNI]  Failed to see startup log message; retrying...
2023-02-23 23:55:08,032 [MainThread  ] [WARNI]  Failed to see startup log message; retrying...
2023-02-23 23:55:13,033 [MainThread  ] [WARNI]  Failed to see startup log message; retrying...
2023-02-23 23:55:18,034 [MainThread  ] [ERROR]  Tika startup log message not received after 3 tries.
2023-02-23 23:55:18,037 [MainThread  ] [ERROR]  Failed to receive startup confirmation from startServer.

On the other hand, this is the content of tika-server.log:

Caused by: java.lang.ClassNotFoundException: org.apache.tika.server.core.TikaServerCli

Please, let me know if you need more information.

chrismattmann commented 1 year ago

thanks @Marcos-A in your /tmp directory are you seeing the corresponding tika-server-standard-2.6.0.jar file? the error in tika-server.log seems to indicate that the jar file is somehow corrupt. Can you check the contents of the jar file?

Marcos-A commented 1 year ago

Thank you so much, @chrismattmann, your suspicions were correct. The JAR file was corrupt. The strangest thing is that I tried downloading it several times before, even with different versions of Tika Server without much success. Anyway, I tested it downloading the file again and it worked flawlessly. Thank you very much for your support and hard work.

chrismattmann commented 1 year ago

anytime thank you @Marcos-A !

chrismattmann commented 1 year ago

also thank you so much for the tip @Marcos-A !