chrismattmann / tika-python

Tika-Python is a Python binding to the Apache Tika™ REST services allowing Tika to be called natively in the Python community.
Apache License 2.0
1.49k stars 234 forks source link

`parser.from_file()` does not work with serverEndpoint different from localhost #273

Closed Tooa closed 4 years ago

Tooa commented 4 years ago

Summary

Setting a serverEndpoint different from localhost throws an exception. See the snippet below:

import tika
from tika import parser
tika.TikaClientOnly = True
result = parser.from_file('file', 'http://10.5.0.5:9998')

Steps to reproduce

services: tika-server: image: logicalspark/docker-tikaserver:latest networks: net: ipv4_address: 10.5.0.5

consumer:
    build: ./
    networks:
      net:
        ipv4_address: 10.5.0.6
    command: ["document_consumer"]

networks: net: driver: bridge ipam: config:

Expected Behaviour

Actual Behaviour

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python3.7/site-packages/tika/parser.py", line 36, in from_file
    output = parse1(service, filename, serverEndpoint, headers=headers, config_path=config_path, requestOptions=requestOptions)
  File "/usr/lib/python3.7/site-packages/tika/tika.py", line 331, in parse1
    rawResponse=rawResponse, requestOptions=requestOptions)
  File "/usr/lib/python3.7/site-packages/tika/tika.py", line 547, in callServer
    resp = verbFn(serviceUrl, encodedData, **effectiveRequestOptions)
  File "/usr/lib/python3.7/site-packages/requests/api.py", line 131, in put
    return request('put', url, data=data, **kwargs)
  File "/usr/lib/python3.7/site-packages/requests/api.py", line 60, in request
    return session.request(method=method, url=url, **kwargs)
  File "/usr/lib/python3.7/site-packages/requests/sessions.py", line 533, in request
    resp = self.send(prep, **send_kwargs)
  File "/usr/lib/python3.7/site-packages/requests/sessions.py", line 646, in send
    r = adapter.send(request, **kwargs)
  File "/usr/lib/python3.7/site-packages/requests/adapters.py", line 516, in send
    raise ConnectionError(e, request=request)
requests.exceptions.ConnectionError: HTTPConnectionPool(host='localhost', port=9998): Max retries exceeded with url: /rmeta/text (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f4665883d50>: Failed to establish a new connection: [Errno 111] Connection refused'))

Analysis and Details

import tika
from tika import parser
tika.TikaClientOnly = True
result = parser.from_buffer('Good Evening', 'http://10.5.0.5:9998')
print(result)
{'metadata': {'Content-Encoding': 'ISO-8859-1', 'Content-Type': 'text/plain; charset=ISO-8859-1', 'X-Parsed-By': ['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.csv.TextAndCSVParser'], 'X-TIKA:content_handler': 'ToTextContentHandler', 'X-TIKA:embedded_depth': '0', 'X-TIKA:parse_time_millis': '92'}, 'content': '\n\n\n\n\n\n\n\nGood Evening\n', 'status': 200}

Workaround

Type "help", "copyright", "credits" or "license" for more information.
>>> import tika
>>> from tika import parser
>>> tika.TikaClientOnly = True
>>> result = parser.from_file('file', 'all', 'http://10.5.0.5:9998/')
>>> print(result)
{'metadata': {'Content-Encoding': 'ISO-8859-1', 'Content-Type': 'text/plain; charset=ISO-8859-1', 'X-Parsed-By': ['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.csv.TextAndCSVParser'], 'X-TIKA:content_handler': 'ToTextContentHandler', 'X-TIKA:embedded_depth': '0', 'X-TIKA:parse_time_millis': '94', 'resourceName': "b'file'"}, 'content': '\n\n\n\n\n\n\n\n\nThis is Sparta!\n\n', 'status': 200}
>>>
chrismattmann commented 4 years ago

thanks @Tooa fantastic, looks like a legit issue with the doco, and/or expected behavior. Care to submit a simple PR that fixes .from_file to work as expected? Thanks for the fantastic issue and report.

Tooa commented 4 years ago

@chrismattmann Thank you. I proposed a change including a test case as PR.