chrismattmann / tika-python

Tika-Python is a Python binding to the Apache Tika™ REST services allowing Tika to be called natively in the Python community.
Apache License 2.0
1.51k stars 236 forks source link

fix unpack from_file/from_buffer headers arg #387

Closed deadc0de6 closed 1 year ago

deadc0de6 commented 1 year ago

It seems to me that the headers= argument was misplaced and was expected in from_buffer instead of from_file. Thanks for your work on python-tika!

chrismattmann commented 1 year ago

thanks a lot @deadc0de6 can you provide a unit test that exposes this and the fix?

deadc0de6 commented 1 year ago

@chrismattmann I don't have a unit test file but running pylint on it exposes the bug (as well as other errors / problems):

unpack.py:33:55: W0613: Unused argument 'headers' (unused-argument)
unpack.py:57:60: E0602: Undefined variable 'headers' (undefined-variable)

Here's the full log on the unpack.py file

$ pylint unpack.py      
************* Module tika.unpack
unpack.py:101:0: C0301: Line too long (117/100) (line-too-long)
unpack.py:122:1: W0511: TODO: Remove if/when fixed. https://issues.apache.org/jira/browse/TIKA-3070 (fixme)
unpack.py:1:0: C0114: Missing module docstring (missing-module-docstring)
unpack.py:33:24: C0103: Argument name "serverEndpoint" doesn't conform to snake_case naming style (invalid-name)
unpack.py:33:69: C0103: Argument name "requestOptions" doesn't conform to snake_case naming style (invalid-name)
unpack.py:33:0: W0102: Dangerous default value {} as argument (dangerous-default-value)
unpack.py:40:4: C0103: Variable name "tarOutput" doesn't conform to snake_case naming style (invalid-name)
unpack.py:33:55: W0613: Unused argument 'headers' (unused-argument)
unpack.py:48:24: C0103: Argument name "serverEndpoint" doesn't conform to snake_case naming style (invalid-name)
unpack.py:48:55: C0103: Argument name "requestOptions" doesn't conform to snake_case naming style (invalid-name)
unpack.py:48:0: W0102: Dangerous default value {} as argument (dangerous-default-value)
unpack.py:55:23: E1124: Argument 'headers' passed by position and keyword in function call (redundant-keyword-arg)
unpack.py:57:60: E0602: Undefined variable 'headers' (undefined-variable)
unpack.py:62:11: C0103: Argument name "tarOutput" doesn't conform to snake_case naming style (invalid-name)
unpack.py:62:0: R0914: Too many local variables (16/15) (too-many-locals)
unpack.py:64:4: R1705: Unnecessary "elif" after "return", remove the leading "el" from "elif" (no-else-return)
unpack.py:69:56: C0103: Variable name "tarFile" doesn't conform to snake_case naming style (invalid-name)
unpack.py:71:8: C0103: Variable name "memberNames" doesn't conform to snake_case naming style (invalid-name)
unpack.py:78:8: C0103: Variable name "metadataMember" doesn't conform to snake_case naming style (invalid-name)
unpack.py:80:80: C0103: Variable name "metadataFile" doesn't conform to snake_case naming style (invalid-name)
unpack.py:81:16: C0103: Variable name "metadataReader" doesn't conform to snake_case naming style (invalid-name)
unpack.py:82:20: C0103: Variable name "metadataLine" doesn't conform to snake_case naming style (invalid-name)
unpack.py:98:12: C0103: Variable name "contentMember" doesn't conform to snake_case naming style (invalid-name)
unpack.py:110:12: C0103: Variable name "attachmentMember" doesn't conform to snake_case naming style (invalid-name)
unpack.py:62:0: R0912: Too many branches (13/12) (too-many-branches)
unpack.py:123:20: C0103: Argument name "s" doesn't conform to snake_case naming style (invalid-name)
unpack.py:20:0: C0411: standard import "import tarfile" should be placed before "from .tika import parse1, callServer, ServerEndpoint" (wrong-import-order)
unpack.py:21:0: C0411: standard import "from io import BytesIO, TextIOWrapper" should be placed before "from .tika import parse1, callServer, ServerEndpoint" (wrong-import-order)
unpack.py:22:0: C0411: standard import "import csv" should be placed before "from .tika import parse1, callServer, ServerEndpoint" (wrong-import-order)
unpack.py:23:0: C0411: standard import "from sys import version_info" should be placed before "from .tika import parse1, callServer, ServerEndpoint" (wrong-import-order)
unpack.py:24:0: C0411: standard import "from contextlib import closing" should be placed before "from .tika import parse1, callServer, ServerEndpoint" (wrong-import-order)

-----------------------------------
Your code has been rated at 3.04/10
deadc0de6 commented 1 year ago

Also

>>> from tika import unpack
>>> unpack.from_buffer('what?')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/tmp/tika-python/tika/unpack.py", line 57, in from_buffer
    rawResponse=True, headers=headers, requestOptions=requestOptions)
NameError: name 'headers' is not defined
>>> unpack.from_buffer('what?', headers='whatever')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: from_buffer() got an unexpected keyword argument 'headers'
chrismattmann commented 1 year ago

So I applied this PR in a local branch and got the following:

mattmann@lasagna:~/git/tika-python$ /home/mattmann/install/python2/bin/python 
Python 2.7.18rc1 (default, Apr  7 2020, 12:05:55) 
[GCC 9.3.0] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from tika import unpack
>>> unpack.from_buffer('What?')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "tika/unpack.py", line 57, in from_buffer
    rawResponse=True, headers=headers, requestOptions=requestOptions)
TypeError: callServer() got multiple values for keyword argument 'headers'
>>> unpack.from_buffer('What?', headers='whatever')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "tika/unpack.py", line 57, in from_buffer
    rawResponse=True, headers=headers, requestOptions=requestOptions)
TypeError: callServer() got multiple values for keyword argument 'headers'
>>> 

Any ideas @deadc0de6 ?

chrismattmann commented 1 year ago

OK I fixed it. Just needed to ensure there weren't duplicate headers being passed. I also will add and commit a unit test, shortly. One sec.

mattmann@lasagna:~/git/tika-python$ /home/mattmann/install/python2/bin/python 
Python 2.7.18rc1 (default, Apr  7 2020, 12:05:55) 
[GCC 9.3.0] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from tika import unpack
>>> unpack.from_buffer('what?')
2023-01-16 16:54:59,127 [MainThread  ] [INFO ]  Retrieving http://search.maven.org/remotecontent?filepath=org/apache/tika/tika-server-standard/2.6.0/tika-server-standard-2.6.0.jar to /tmp/tika-server.jar.
2023-01-16 16:55:06,174 [MainThread  ] [INFO ]  Retrieving http://search.maven.org/remotecontent?filepath=org/apache/tika/tika-server-standard/2.6.0/tika-server-standard-2.6.0.jar.md5 to /tmp/tika-server.jar.md5.
2023-01-16 16:55:06,695 [MainThread  ] [WARNI]  Failed to see startup log message; retrying...
{'content': u'what?\n', 'attachments': {}, 'metadata': {'Content-Length': '5', 'X-TIKA:Parsed-By-Full-Set': ['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.csv.TextAndCSVParser'], 'Content-Type': 'text/plain; charset=ISO-8859-1', 'X-TIKA:Parsed-By': ['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.csv.TextAndCSVParser'], 'Content-Encoding': 'ISO-8859-1'}}
>>> unpack.from_buffer('what?',headers={'param': 'whatever'})
{'content': u'what?\n', 'attachments': {}, 'metadata': {'Content-Length': '5', 'X-TIKA:Parsed-By-Full-Set': ['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.csv.TextAndCSVParser'], 'Content-Type': 'text/plain; charset=ISO-8859-1', 'X-TIKA:Parsed-By': ['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.csv.TextAndCSVParser'], 'Content-Encoding': 'ISO-8859-1'}}
>>> 
chrismattmann commented 1 year ago

Alright, merged with unit tests passing! Thank you @deadc0de6 !