chrismattmann / tika-python

Tika-Python is a Python binding to the Apache Tika™ REST services allowing Tika to be called natively in the Python community.
Apache License 2.0
1.51k stars 235 forks source link

Unable to start Tika Server and get corrupt file when running tika-server.jar #371

Closed devipramita closed 1 year ago

devipramita commented 2 years ago

Hi, I get this error when parsing pdf using Tika error Tika Server

To overcome this issue, I've tried:

seeing this solution https://github.com/chrismattmann/tika-python/issues/238#issuecomment-527315954 I tried to run java -jar <> but it gives me another error "Invalid or corrupt jarfile tika-server.jar"

Meanwhile, the downloaded tika-server.jar contains "tika-server.jar: HTML document, ASCII text, with CRLF line terminators"

Anyone has any solution ideas to this?

Thank you

divyaksh-shukla commented 2 years ago

Looks like you have the wrong installation of apache-tika, rather than the jar you only have a html page. As per the current download CDN on apache tika website there are only 2 versions available: 1.28.4 & 2.4.1. You can use the below commands to download apache tika on linux

# For 1.28.4
wget https://dlcdn.apache.org/tika/1.28.4/tika-server-1.28.4.jar

# For 2.4.1
wget https://dlcdn.apache.org/tika/2.4.1/tika-server-standard-2.4.1.jar

You can verify if you have done the correct file download by running the following command and comparing your output to the below output:

$ file tika-server-1.28.4.jar 
tika-server-1.28.4.jar: Zip archive data, at least v2.0 to extract
chrismattmann commented 1 year ago

correct @divyaksh-shukla . Closing this one out.