chrismattmann / tika-python

Tika-Python is a Python binding to the Apache Tika™ REST services allowing Tika to be called natively in the Python community.
Apache License 2.0
1.51k stars 234 forks source link

how to directly read tika server.jar file from local or server fixed dir, include first time run #231

Closed sunweiconfidence closed 5 years ago

sunweiconfidence commented 5 years ago

i debug tika to parse file, found that first time it need to download and install it from remote url, it take long time, if i will deploy the program to app server, no network, how do i deal with it? what method i can solve it by coding it from fixed directory? thanks

Harshitg10 commented 5 years ago

Set TIKA_SERVER_JAR before you import tika package os.environ["TIKA_SERVER_JAR"] = << file location >>

sunweiconfidence commented 5 years ago

ok, thanks

raj5287 commented 5 years ago

@sunweiconfidence hey, did it resolve the issue? I am doing this :

import os
os.environ["TIKA_SERVER_JAR"] = "/home/user/Downloads/tika-server-1.19.jar"
from tika import parser

raw = parser.from_file('sample.pdf')
print(raw['content'])

but still it is going online to download the file and throwing me an error.

sunweiconfidence commented 5 years ago

@raj5287 yes, i resolve the issue, you need to change the code in tika's source code like this path's file C:\Program Files\Python36\Lib\site-packages\tika\tika.py

luke4u commented 4 years ago

@sunweiconfidence @raj5287 @chrismattmann , I still got error after adding this at top oftika.py os.environ["TIKA_SERVER_JAR"] = r'C:\Users\Desktop\deep_pavlov\tika_server\tika-server.jar' . Not sure what I missed, can you share more insights?

error below: Retrieving C:\Users\Desktop\deep_pavlov\tika_server\tika-server.jar to C:\Users\AppData\Local\Temp\tika-server.jar.

URLError: <urlopen error unknown url type: c>

chrismattmann commented 4 years ago

I think you need to add a file: in front of your string, like this:

os.environ["TIKA_SERVER_JAR"] = r'file:C:\Users\Desktop\deep_pavlov\tika_server\tika-server.jar'
sunweiconfidence commented 4 years ago

@luke4u my complete code change as below: code change path: C:\Program Files\Python36\Lib\site-packages\tika\tika.py code change point as below bold text: log_path = os.getenv('TIKA_LOG_PATH', 'D:/inetpub/wwwroot/bertapi/Tika/') log_file = os.path.join(log_path, 'tika.log')

logFormatter = logging.Formatter("%(asctime)s [%(threadName)-12.12s] [%(levelname)-5.5s] %(message)s") log = logging.getLogger('tika.tika')

File logs

fileHandler = logging.FileHandler(log_file) fileHandler.setFormatter(logFormatter) log.addHandler(fileHandler)

Stdout logs

consoleHandler = logging.StreamHandler() consoleHandler.setFormatter(logFormatter) log.addHandler(consoleHandler)

Log level

log.setLevel(logging.INFO)

Windows = True if platform.system() == "Windows" else False TikaVersion = os.getenv('TIKA_VERSION', '1.19') TikaJarPath = os.getenv('TIKA_PATH', 'D:/inetpub/wwwroot/bertapi/Tika/') TikaFilesPath = "D:/inetpub/wwwroot/bertapi/Tika/" TikaServerLogFilePath = log_path TikaServerJar = os.getenv( 'TIKA_SERVER_JAR', TikaFilesPath + "tika-server"+ ".jar") ServerHost = "localhost"

by the way,D:/inetpub/wwwroot/bertapi/Tika/ this path, you need have tika-server.jar, tika-server.jar.md5, tika-server.log,tika.log file in advance

luke4u commented 4 years ago

Thanks guys. problem solved!

ruohola commented 2 years ago

Thanks for the solutions. I wanted to keep the .jar next to my code. To recap what worked:

File structure:

.
├── lib
│   ├── tika-server.jar
│   └── tika-server.jar.md5
└── main.py

Code in main.py:

import os
from pathlib import Path

ROOT_DIR = Path(__file__).resolve(strict=True).parent

os.environ["TIKA_SERVER_JAR"] = f"file://{ROOT_DIR / 'lib/tika-server.jar'}"

import tika.parser

# Use tika.parser