chrismattmann / tika-python

Tika-Python is a Python binding to the Apache Tika™ REST services allowing Tika to be called natively in the Python community.
Apache License 2.0
1.51k stars 234 forks source link

Compatibility with Apache Tika version 2.1.0 #359

Closed bikashg closed 1 year ago

bikashg commented 2 years ago

Hi @chrismattmann ,

Fantastic library! I was wondering if you have near plans/roadmap to make it compatible with Apache Tika version 2.1.0

I used the tika-server-standard-2.1.0.jar file from https://tika.apache.org/download.html to run locally on my machine but get the following error:


>>> os.environ["TIKA_SERVER_JAR"] = "file:////home/bikashg/work/tealbear/tika-server-standard-2.1.0.jar"
>>> import tika
>>> tika.initVM()
>>> from tika import parser
>>> parsed1 = parser.from_file('notes.txt')
2021-11-16 16:46:31,249 [MainThread  ] [INFO ]  Retrieving file:////home/bikashg/work/tealbear/tika-server-standard-2.1.0.jar to /tmp/tika-server.jar.
2021-11-16 16:46:31,309 [MainThread  ] [INFO ]  Retrieving file:////home/bikashg/work/tealbear/tika-server-standard-2.1.0.jar.md5 to /tmp/tika-server.jar.md5.
2021-11-16 16:46:31,410 [MainThread  ] [INFO ]  Retrieving file:////home/bikashg/work/tealbear/tika-server-standard-2.1.0.jar to /tmp/tika-server.jar.
2021-11-16 16:46:31,456 [MainThread  ] [WARNI]  Failed to see startup log message; retrying...
2021-11-16 16:46:36,462 [MainThread  ] [WARNI]  Failed to see startup log message; retrying...
2021-11-16 16:46:41,467 [MainThread  ] [WARNI]  Failed to see startup log message; retrying...
2021-11-16 16:46:46,472 [MainThread  ] [ERROR]  Tika startup log message not received after 3 tries.
2021-11-16 16:46:46,473 [MainThread  ] [ERROR]  Failed to receive startup confirmation from startServer.
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/bikashg/miniconda3/envs/temp_bikash2/lib/python3.9/site-packages/tika/parser.py", line 40, in from_file
    output = parse1(service, filename, serverEndpoint, headers=headers, config_path=config_path, requestOptions=requestOptions)
  File "/home/bikashg/miniconda3/envs/temp_bikash2/lib/python3.9/site-packages/tika/tika.py", line 336, in parse1
    status, response = callServer('put', serverEndpoint, service, f,
  File "/home/bikashg/miniconda3/envs/temp_bikash2/lib/python3.9/site-packages/tika/tika.py", line 531, in callServer
    serverEndpoint = checkTikaServer(scheme, serverHost, port, tikaServerJar, classpath, config_path)
  File "/home/bikashg/miniconda3/envs/temp_bikash2/lib/python3.9/site-packages/tika/tika.py", line 601, in checkTikaServer
    raise RuntimeError("Unable to start Tika server.")
RuntimeError: Unable to start Tika server.
>>> exit()```
james-bcs commented 2 years ago

After the Tika team releases CVE-2021-44228 fixed version. Tika 2.1.0 is the first Tika release to use log4j2 which most likely within the identified Log4J2 versions that have the CVE-2021-44228 vulnerability.

kalebmckale commented 2 years ago

Apache Tika 2.2.1 has implemented Log4J2 2.17.0, which has addressed all but the most recent CVE-2021-44832 that requires an attacker to have access to the actual configuration file. It appears that Tika is in no hurry to release a new version with Log4J2 2.17.1. So, the question is: Will tika-python be waiting until this happens even with the vulnerabilities that Apache Tika 1.24.1 has?

nickchomey commented 2 years ago

Tika 2.3.0 addressed log4j 2.17.1 , so that seems to satisfy the remaining issue here.

Moreover, we're now at Tika 2.4.1 AND 1.x will stop receiving updates in 3 weeks (Sept 30,2022). So, we really need to make this project compatible with the latest versions.

@chrismattmann thanks for all you've done, but could you please give us some guidance as to whether this project is completely abandoned now? Should those who are using it make other plans - be it forking it or something else?

kalebmckale commented 1 year ago

@nickchomey @bikashg Our requests have been heard and it's now an active WIP (see #377).

chrismattmann commented 1 year ago

thanks, sorry for the delays on updates. I will spend some time over the winter holidays here getting this merged.

chrismattmann commented 1 year ago

OK not in this release (which is going to be 1.24.2) but I have 2 PRs I will look at for 2.6.x release which I will make next week. Thanks. This 1.24.2 release will include all the updates the past 2 years that haven't been released.