chrismattmann / tika-python

Tika-Python is a Python binding to the Apache Tika™ REST services allowing Tika to be called natively in the Python community.
Apache License 2.0
1.51k stars 236 forks source link

Fixed issue #377 #381

Closed amensiko closed 1 year ago

amensiko commented 2 years ago

This upgrades tika-python to Tika 2.6.0, as per issue #377

chrismattmann commented 1 year ago

hi @amensiko can you take a look? I am going through and closing some old PRs and I think I created a conflict. I will take a look myself but if you have time please update :)

chrismattmann commented 1 year ago

I reviewed this PR. It has a host of hard code things (like the download URL) that break back compat, and mostly formatting, ancillary changes that I wouldn't commit. There are a few changes I can actually useful, mainly:

I think that's it. The PR shouldn't change anything in tika.py but those things. I'll try and get to this today. @amensiko @tballison

chrismattmann commented 1 year ago

OK I have a much simpler patch, here:

mattmann@lasagna:~/git/tika-python$ git diff
diff --git a/tika/tika.py b/tika/tika.py
index 04f3202..4f91111 100755
--- a/tika/tika.py
+++ b/tika/tika.py
@@ -172,7 +172,7 @@ TikaFilesPath = tempfile.gettempdir()
 TikaServerLogFilePath = log_path
 TikaServerJar = os.getenv(
     'TIKA_SERVER_JAR',
-    "http://search.maven.org/remotecontent?filepath=org/apache/tika/tika-server/"+TikaVersion+"/tika-server-"+TikaVersion+".jar")
+    "http://search.maven.org/remotecontent?filepath=org/apache/tika/tika-server-standard/"+TikaVersion+"/tika-server-standard-"+TikaVersion+".jar")
 ServerHost = "localhost"
 Port = "9998"
 ServerEndpoint = os.getenv(
@@ -648,10 +648,10 @@ def startServer(tikaServerJar, java_path = TikaJava, java_args = TikaJavaArgs, s
     # setup command string
     cmd_string = ""
     if not config_path:
-        cmd_string = '%s %s -cp "%s" org.apache.tika.server.TikaServerCli --port %s --host %s &' \
+        cmd_string = '%s %s -cp "%s" org.apache.tika.server.core.TikaServerCli --port %s --host %s &' \
                      % (java_path, java_args, classpath, port, host)
     else:
-        cmd_string = '%s %s -cp "%s" org.apache.tika.server.TikaServerCli --port %s --host %s --config %s &' \
+        cmd_string = '%s %s -cp "%s" org.apache.tika.server.core.TikaServerCli --port %s --host %s --config %s &' \
                      % (java_path, java_args, classpath, port, host, config_path)

     # Check that we can write to log path
@@ -688,7 +688,7 @@ def startServer(tikaServerJar, java_path = TikaJava, java_args = TikaJavaArgs, s
     while try_count < TikaStartupMaxRetry:
         with open(tika_log_file_path, "r") as tika_log_file_tmp:
             # check for INFO string to confirm listening endpoint
-            if "Started Apache Tika server at" in tika_log_file_tmp.read():
+            if "Started Apache Tika server" in tika_log_file_tmp.read():

That said, two tests are failing (the test_unpack tests). See below:

======================================================================
ERROR: test_ascii (tika.tests.tests_unpack.CreateTest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/mattmann/git/tika-python/tika/tests/tests_unpack.py", line 26, in test_ascii
    parsed = unpack.from_file(f.name)
  File "/home/mattmann/git/tika-python/tika/unpack.py", line 44, in from_file
    return _parse(tarOutput)
  File "/home/mattmann/git/tika-python/tika/unpack.py", line 79, in _parse
    with _text_wrapper(tarFile.extractfile(metadataMember)) as metadataFile:
AttributeError: __exit__

======================================================================
ERROR: test_utf8 (tika.tests.tests_unpack.CreateTest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/mattmann/git/tika-python/tika/tests/tests_unpack.py", line 18, in test_utf8
    parsed = unpack.from_file(f.name)
  File "/home/mattmann/git/tika-python/tika/unpack.py", line 44, in from_file
    return _parse(tarOutput)
  File "/home/mattmann/git/tika-python/tika/unpack.py", line 79, in _parse
    with _text_wrapper(tarFile.extractfile(metadataMember)) as metadataFile:
AttributeError: __exit__

----------------------------------------------------------------------
Ran 18 tests in 71.383s

FAILED (errors=2)
Test failed: <unittest.runner.TextTestResult run=18 errors=2 failures=0>
error: Test failed: <unittest.runner.TextTestResult run=18 errors=2 failures=0>

I'll debug these now.

chrismattmann commented 1 year ago

So the unpack errors had nothing to do with this patch they had to do with an older version of python I was testing on (2.7). I have a fix for both 2.7 and 3.7 Python, which I will commit separately. All tests pass now.