Closed sunweiconfidence closed 5 years ago
Set TIKA_SERVER_JAR before you import tika package os.environ["TIKA_SERVER_JAR"] = << file location >>
ok, thanks
@sunweiconfidence hey, did it resolve the issue? I am doing this :
import os
os.environ["TIKA_SERVER_JAR"] = "/home/user/Downloads/tika-server-1.19.jar"
from tika import parser
raw = parser.from_file('sample.pdf')
print(raw['content'])
but still it is going online to download the file and throwing me an error.
@raj5287 yes, i resolve the issue, you need to change the code in tika's source code like this path's file C:\Program Files\Python36\Lib\site-packages\tika\tika.py
@sunweiconfidence @raj5287 @chrismattmann , I still got error after adding this at top oftika.py
os.environ["TIKA_SERVER_JAR"] = r'C:\Users\Desktop\deep_pavlov\tika_server\tika-server.jar'
. Not sure what I missed, can you share more insights?
error below:
Retrieving C:\Users\Desktop\deep_pavlov\tika_server\tika-server.jar to C:\Users\AppData\Local\Temp\tika-server.jar.
URLError: <urlopen error unknown url type: c>
I think you need to add a file:
in front of your string, like this:
os.environ["TIKA_SERVER_JAR"] = r'file:C:\Users\Desktop\deep_pavlov\tika_server\tika-server.jar'
@luke4u my complete code change as below: code change path: C:\Program Files\Python36\Lib\site-packages\tika\tika.py code change point as below bold text: log_path = os.getenv('TIKA_LOG_PATH', 'D:/inetpub/wwwroot/bertapi/Tika/') log_file = os.path.join(log_path, 'tika.log')
logFormatter = logging.Formatter("%(asctime)s [%(threadName)-12.12s] [%(levelname)-5.5s] %(message)s") log = logging.getLogger('tika.tika')
fileHandler = logging.FileHandler(log_file) fileHandler.setFormatter(logFormatter) log.addHandler(fileHandler)
consoleHandler = logging.StreamHandler() consoleHandler.setFormatter(logFormatter) log.addHandler(consoleHandler)
log.setLevel(logging.INFO)
Windows = True if platform.system() == "Windows" else False TikaVersion = os.getenv('TIKA_VERSION', '1.19') TikaJarPath = os.getenv('TIKA_PATH', 'D:/inetpub/wwwroot/bertapi/Tika/') TikaFilesPath = "D:/inetpub/wwwroot/bertapi/Tika/" TikaServerLogFilePath = log_path TikaServerJar = os.getenv( 'TIKA_SERVER_JAR', TikaFilesPath + "tika-server"+ ".jar") ServerHost = "localhost"
by the way,D:/inetpub/wwwroot/bertapi/Tika/ this path, you need have tika-server.jar, tika-server.jar.md5, tika-server.log,tika.log file in advance
Thanks guys. problem solved!
Thanks for the solutions. I wanted to keep the .jar next to my code. To recap what worked:
File structure:
.
├── lib
│ ├── tika-server.jar
│ └── tika-server.jar.md5
└── main.py
Code in main.py
:
import os
from pathlib import Path
ROOT_DIR = Path(__file__).resolve(strict=True).parent
os.environ["TIKA_SERVER_JAR"] = f"file://{ROOT_DIR / 'lib/tika-server.jar'}"
import tika.parser
# Use tika.parser
i debug tika to parse file, found that first time it need to download and install it from remote url, it take long time, if i will deploy the program to app server, no network, how do i deal with it? what method i can solve it by coding it from fixed directory? thanks