Closed deadc0de6 closed 1 year ago
thanks a lot @deadc0de6 can you provide a unit test that exposes this and the fix?
@chrismattmann I don't have a unit test file but running pylint on it exposes the bug (as well as other errors / problems):
unpack.py:33:55: W0613: Unused argument 'headers' (unused-argument)
unpack.py:57:60: E0602: Undefined variable 'headers' (undefined-variable)
Here's the full log on the unpack.py
file
$ pylint unpack.py
************* Module tika.unpack
unpack.py:101:0: C0301: Line too long (117/100) (line-too-long)
unpack.py:122:1: W0511: TODO: Remove if/when fixed. https://issues.apache.org/jira/browse/TIKA-3070 (fixme)
unpack.py:1:0: C0114: Missing module docstring (missing-module-docstring)
unpack.py:33:24: C0103: Argument name "serverEndpoint" doesn't conform to snake_case naming style (invalid-name)
unpack.py:33:69: C0103: Argument name "requestOptions" doesn't conform to snake_case naming style (invalid-name)
unpack.py:33:0: W0102: Dangerous default value {} as argument (dangerous-default-value)
unpack.py:40:4: C0103: Variable name "tarOutput" doesn't conform to snake_case naming style (invalid-name)
unpack.py:33:55: W0613: Unused argument 'headers' (unused-argument)
unpack.py:48:24: C0103: Argument name "serverEndpoint" doesn't conform to snake_case naming style (invalid-name)
unpack.py:48:55: C0103: Argument name "requestOptions" doesn't conform to snake_case naming style (invalid-name)
unpack.py:48:0: W0102: Dangerous default value {} as argument (dangerous-default-value)
unpack.py:55:23: E1124: Argument 'headers' passed by position and keyword in function call (redundant-keyword-arg)
unpack.py:57:60: E0602: Undefined variable 'headers' (undefined-variable)
unpack.py:62:11: C0103: Argument name "tarOutput" doesn't conform to snake_case naming style (invalid-name)
unpack.py:62:0: R0914: Too many local variables (16/15) (too-many-locals)
unpack.py:64:4: R1705: Unnecessary "elif" after "return", remove the leading "el" from "elif" (no-else-return)
unpack.py:69:56: C0103: Variable name "tarFile" doesn't conform to snake_case naming style (invalid-name)
unpack.py:71:8: C0103: Variable name "memberNames" doesn't conform to snake_case naming style (invalid-name)
unpack.py:78:8: C0103: Variable name "metadataMember" doesn't conform to snake_case naming style (invalid-name)
unpack.py:80:80: C0103: Variable name "metadataFile" doesn't conform to snake_case naming style (invalid-name)
unpack.py:81:16: C0103: Variable name "metadataReader" doesn't conform to snake_case naming style (invalid-name)
unpack.py:82:20: C0103: Variable name "metadataLine" doesn't conform to snake_case naming style (invalid-name)
unpack.py:98:12: C0103: Variable name "contentMember" doesn't conform to snake_case naming style (invalid-name)
unpack.py:110:12: C0103: Variable name "attachmentMember" doesn't conform to snake_case naming style (invalid-name)
unpack.py:62:0: R0912: Too many branches (13/12) (too-many-branches)
unpack.py:123:20: C0103: Argument name "s" doesn't conform to snake_case naming style (invalid-name)
unpack.py:20:0: C0411: standard import "import tarfile" should be placed before "from .tika import parse1, callServer, ServerEndpoint" (wrong-import-order)
unpack.py:21:0: C0411: standard import "from io import BytesIO, TextIOWrapper" should be placed before "from .tika import parse1, callServer, ServerEndpoint" (wrong-import-order)
unpack.py:22:0: C0411: standard import "import csv" should be placed before "from .tika import parse1, callServer, ServerEndpoint" (wrong-import-order)
unpack.py:23:0: C0411: standard import "from sys import version_info" should be placed before "from .tika import parse1, callServer, ServerEndpoint" (wrong-import-order)
unpack.py:24:0: C0411: standard import "from contextlib import closing" should be placed before "from .tika import parse1, callServer, ServerEndpoint" (wrong-import-order)
-----------------------------------
Your code has been rated at 3.04/10
Also
>>> from tika import unpack
>>> unpack.from_buffer('what?')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/tmp/tika-python/tika/unpack.py", line 57, in from_buffer
rawResponse=True, headers=headers, requestOptions=requestOptions)
NameError: name 'headers' is not defined
>>> unpack.from_buffer('what?', headers='whatever')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: from_buffer() got an unexpected keyword argument 'headers'
So I applied this PR in a local branch and got the following:
mattmann@lasagna:~/git/tika-python$ /home/mattmann/install/python2/bin/python
Python 2.7.18rc1 (default, Apr 7 2020, 12:05:55)
[GCC 9.3.0] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from tika import unpack
>>> unpack.from_buffer('What?')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "tika/unpack.py", line 57, in from_buffer
rawResponse=True, headers=headers, requestOptions=requestOptions)
TypeError: callServer() got multiple values for keyword argument 'headers'
>>> unpack.from_buffer('What?', headers='whatever')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "tika/unpack.py", line 57, in from_buffer
rawResponse=True, headers=headers, requestOptions=requestOptions)
TypeError: callServer() got multiple values for keyword argument 'headers'
>>>
Any ideas @deadc0de6 ?
OK I fixed it. Just needed to ensure there weren't duplicate headers being passed. I also will add and commit a unit test, shortly. One sec.
mattmann@lasagna:~/git/tika-python$ /home/mattmann/install/python2/bin/python
Python 2.7.18rc1 (default, Apr 7 2020, 12:05:55)
[GCC 9.3.0] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from tika import unpack
>>> unpack.from_buffer('what?')
2023-01-16 16:54:59,127 [MainThread ] [INFO ] Retrieving http://search.maven.org/remotecontent?filepath=org/apache/tika/tika-server-standard/2.6.0/tika-server-standard-2.6.0.jar to /tmp/tika-server.jar.
2023-01-16 16:55:06,174 [MainThread ] [INFO ] Retrieving http://search.maven.org/remotecontent?filepath=org/apache/tika/tika-server-standard/2.6.0/tika-server-standard-2.6.0.jar.md5 to /tmp/tika-server.jar.md5.
2023-01-16 16:55:06,695 [MainThread ] [WARNI] Failed to see startup log message; retrying...
{'content': u'what?\n', 'attachments': {}, 'metadata': {'Content-Length': '5', 'X-TIKA:Parsed-By-Full-Set': ['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.csv.TextAndCSVParser'], 'Content-Type': 'text/plain; charset=ISO-8859-1', 'X-TIKA:Parsed-By': ['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.csv.TextAndCSVParser'], 'Content-Encoding': 'ISO-8859-1'}}
>>> unpack.from_buffer('what?',headers={'param': 'whatever'})
{'content': u'what?\n', 'attachments': {}, 'metadata': {'Content-Length': '5', 'X-TIKA:Parsed-By-Full-Set': ['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.csv.TextAndCSVParser'], 'Content-Type': 'text/plain; charset=ISO-8859-1', 'X-TIKA:Parsed-By': ['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.csv.TextAndCSVParser'], 'Content-Encoding': 'ISO-8859-1'}}
>>>
Alright, merged with unit tests passing! Thank you @deadc0de6 !
It seems to me that the
headers=
argument was misplaced and was expected infrom_buffer
instead offrom_file
. Thanks for your work on python-tika!