chrismattmann / tika-python

Tika-Python is a Python binding to the Apache Tika™ REST services allowing Tika to be called natively in the Python community.
Apache License 2.0
1.51k stars 235 forks source link

Tika not working with custom jar path #228

Closed amonaldo closed 5 years ago

amonaldo commented 5 years ago

I'm working on a Python module that uses Tika, and I'm trying to use a custom jar file so that it does not get downloaded each time

I have already placed the jar file and the md5 file inside the module

my_module
========
      __init__.py
      package1
      package2
      package3
          __init__.py
          pdf.py
          tika-server.jar
          tika-server.jar.md5
pdf.py
====

import os
from tika import tika, parser
tika.TikaJarPath = os.path.dirname(__file__)

def get_pdf_text(path):
    parsed = parser.from_file(path)
    return parsed['content']

Tika does not work and this is the output :

 [WARNI]  Failed to see startup log message; retrying...
 [WARNI]  Failed to see startup log message; retrying...
 [WARNI]  Failed to see startup log message; retrying...
 [ERROR]  Tika startup log message not received after 3 tries.

The problem happens when the jar file is inside the module. It works if I specify another location, but that's not an option because when I deploy the Python module, I need the jar file to contain it.

chrismattmann commented 5 years ago

hi @amonaldo what happens when you print(tika.TikaJarPath)?

amonaldo commented 5 years ago

@chrismattmann it prints the path of the module containing the jar file

RafayGhafoor commented 5 years ago

@amonaldo, You need to specify the absolute path to the parameter of dirname which would become like this:

os.path.join(os.getcwd(), __file__)

Moreover, you need to override three variables of tika module i.e., log_path, TikaJarPath,TikaFilesPath in order to make your modified script work.

Modify your pdf.py (updating the filename):

import os
from tika import tika, parser

abs_path = os.path.dirname(os.path.join(os.getcwd(), __file__)) # Store the absolute path of your file (containing .jar)

# Update the required variables
tika.log_path = os.getenv('TIKA_LOG_PATH', abs_path)
tika.TikaJarPath = os.getenv('TIKA_PATH', abs_path)
tika.TikaFilesPath = os.path.dirname(os.path.join(os.getcwd(), __file__))

def get_pdf_text(path):
    parsed = parser.from_file(path)
    return parsed['content']

if __name__ == "__main__":
    pdf_name = "TEST_FILE_NAME" # filename to test
    print(get_pdf_text(pdf_name))
amonaldo commented 5 years ago

@RafayGhafoor I tried but still the same error.

RafayGhafoor commented 5 years ago

@amonaldo, have you tried restarting your computer or killing the Tika-server since the instance of the server keeps running in the background?

amonaldo commented 5 years ago

@RafayGhafoor I'm using Flask and I always restart the server, which causes the Java instance to be destroyed

RafayGhafoor commented 5 years ago

@amonaldo, Can you try it on a separate test module in which flask is not required, then perhaps, we can debug?

amonaldo commented 5 years ago

@RafayGhafoor it works outside Flask. I don't know why it fails when Flask is running

RafayGhafoor commented 5 years ago

@amonaldo, Perhaps, you have module using flask in separate directory. Can you try moving the Tika related files in the same directory and see if the same error occurs?

amonaldo commented 5 years ago

The same thing happens. I have a file called run.py that runs Flask and even when I moved the jar file to the same directory it just doesn't work

RafayGhafoor commented 5 years ago

@amonaldo, Can you show me your run.py code to see how it's using Tika?

amonaldo commented 5 years ago

@RafayGhafoor this is the code

from smartcv.web import app
from waitress import serve

if __name__ == "__main__":
    try:
        serve(app, port=8080, host='0.0.0.0')
    except Exception as e:
        print(str(e))

I'm using waitress to serve the Flask app, which is defined in another module

RafayGhafoor commented 5 years ago

@amonaldo, How/Where it's using tika?

p.s. drop me an email, since this issue doesn't seem like a bug related to tika.

amonaldo commented 5 years ago

@RafayGhafoor Thanks for your time, but I have found a solution although it's not perfect.

I realized that I can get the user home directory using the os module

tika.TikaJarPath = os.path.expanduser("~")

This way Tika works fine and without any problem.