chrismattmann / tika-python

Tika-Python is a Python binding to the Apache Tika™ REST services allowing Tika to be called natively in the Python community.
Apache License 2.0
1.49k stars 234 forks source link

Tika is not working while accessing via proxy #158

Closed Balachandar-R closed 6 years ago

Balachandar-R commented 7 years ago

Hi Team,

I have installed the python-tika-1.14 in Linux (Ubuntu 16.04) box running on cloud. While executing this code

import tika from tika import parser parsed = parser.from_file('/file/path/file.txt')

I didn't any error when i have an open access to the Linux Instance.

When i tried the same code under via proxy causing an issue as follows.

parsed = parser.from_file('/home/path/file.txt') 2017-09-20 04:25:35,531 [MainThread ] [WARNI] Tika server returned status: 403 Traceback (most recent call last): File "", line 1, in File "/usr/local/lib/python2.7/dist-packages/tika/parser.py", line 28, in from_file return _parse(jsonOutput) File "/usr/local/lib/python2.7/dist-packages/tika/parser.py", line 47, in _parse realJson = json.loads(jsonOutput[1]) File "/usr/lib/python2.7/json/init.py", line 339, in loads return _default_decoder.decode(s) File "/usr/lib/python2.7/json/decoder.py", line 364, in decode obj, end = self.raw_decode(s, idx=_w(s, 0).end()) File "/usr/lib/python2.7/json/decoder.py", line 382, in raw_decode raise ValueError("No JSON object could be decoded") ValueError: No JSON object could be decoded

Pls help me on how to set the proxy in python-tika and where to configure this.

Thanks Balachandar

chrismattmann commented 7 years ago

hi @Balachandar-R thanks for your report. What is the proxy URL you are trying to access. I see that you are getting a 403 - does your proxy require credentials?

Balachandar-R commented 7 years ago

Hi @chrismattmann ,

Thanks for your quick response,

Proxy URL is : export http_proxy="http://172.27.66.50:9400" export https_proxy="http://172.27.66.50:9400"

Thanks, Balachandar

chrismattmann commented 7 years ago

Does it require credentials?

Balachandar-R commented 7 years ago

No, Tika server got started and 9998 is open and we could LISTEN (127.0.0.1:9998) via the command netstat -na | grep 9998.

The request reached the proxy sever and response we will not get it back.

chrismattmann commented 7 years ago

can you add a URL parameter to parser.from_file('/path/to/file', 'http://172.27.66.50:9400') and try that? (all methods at the interface level take an optional parameter for a diff Tika server to contact). @Balachandar-R

Balachandar-R commented 6 years ago

@chrismattmann

I m getting the following error when i tried with URL parameter.

p = parser.from_file('/home/yell/sentence_success.txt','http://172.27.66.50:9400') 2017-09-25 03:47:55,617 [MainThread ] [WARNI] Tika server returned status: 504 Traceback (most recent call last): File "", line 1, in File "/usr/local/lib/python2.7/dist-packages/tika/parser.py", line 37, in from_file return _parse(jsonOutput) File "/usr/local/lib/python2.7/dist-packages/tika/parser.py", line 69, in _parse realJson = json.loads(jsonOutput[1]) File "/usr/lib/python2.7/json/init.py", line 339, in loads return _default_decoder.decode(s) File "/usr/lib/python2.7/json/decoder.py", line 364, in decode obj, end = self.raw_decode(s, idx=_w(s, 0).end()) File "/usr/lib/python2.7/json/decoder.py", line 382, in raw_decode raise ValueError("No JSON object could be decoded") ValueError: No JSON object could be decoded

But without proxy it works fine.

Please tell me any other suggestions?

Thanks Balachandar-R

chrismattmann commented 6 years ago

mmm looks like you are getting an HTTP 504 error, which corresponds to:

10.5.5 504 Gateway Timeout. The server, while acting as a gateway or proxy, did not receive a 
timely response from the upstream server specified by the URI (e.g. HTTP, FTP, LDAP) or some  
other auxiliary server (e.g. DNS) it needed to access in attempting to complete the request.
chrismattmann commented 6 years ago

(proxy configuration issue?)

Balachandar-R commented 6 years ago

Hi @chrismattmann ,

This issue got resolved by setting no_proxy="proxy-address" in code level. Thanks for your instant replies @chrismattmann

Then one more doubt in the tika server.is there any restriction on the total no of files for extraction? For some specific excel files tika got failed to extract the content with 403 status.

Any comments on the above?

Thanks Balachandar-R

KramFox commented 6 years ago

HI @Balachandar-R ,

I have the same issue when i try to use tika python via proxy for file. I read your explanation but could you give me more detail about solution applied ?

Thanks, KramFox

omidbadr commented 4 years ago

I found the tika package on my local computer after installing via pip and I manually edited the following lines ( all 4 of them) in tika.py :

urlretrieve(urlOrPath, destPath)

to:

import urllib

create the object, assign it to a variable

proxy = urllib.request.ProxyHandler({'http': '...','https': '...'})

construct a new opener using your proxy settings

opener = urllib.request.build_opener(proxy)

install the openen on the module-level

urllib.request.install_opener(opener) urllib.request.urlretrieve(urlOrPath, destPath)

and it worked !!

chrismattmann commented 4 years ago

hi @omidbadr if you get a chance, consider sending me a PR by making the above an optional configuration?

hubgitadi commented 3 years ago

@omidbadr Thank you mate, your trick worked like a charm.

@chrismattmann - I was getting "HTTPError: HTTP Error 407: Proxy Authentication Required" error but @omidbadr 's solution came to rescue. But still trying to understand the root cause, can you help?

chrismattmann commented 3 years ago

not sure about the root cause, likely buried in the requests lib

Also can someone send me a PR @hubgitadi @omidbadr so that we can make the above an optional config with docs?

mansvi13 commented 3 years ago

hye @chrismattmann

I'm getting warning: Tika server returned status: 403 and JSONDecodeError...while using it in the unix terminal... how can i solve this issue?

chrismattmann commented 3 years ago

make sure that your tika server started, and that you have Java installed.

ghost commented 3 years ago

Error 403, able to resolve by configuring useragent in python requests module. Can you tell how can I pass user agent in tika

ashish735 commented 3 years ago

@chrismattmann shall I make the PR for this? that is to make Tika work properly while accessing via proxy?

chrismattmann commented 3 years ago

sure I'll take a look @ashish735