chrismattmann / tika-python

Tika-Python is a Python binding to the Apache Tika™ REST services allowing Tika to be called natively in the Python community.
Apache License 2.0
1.51k stars 234 forks source link

How to handles cases where if I iterate over 100k files at once it fails after parsing a large number? #345

Closed user06039 closed 1 year ago

user06039 commented 3 years ago

I'm using apache tika python client to parse pdf files but in my case I have more than a million documents. I think tika has some limitation where after parsing some 100k files then it starts to fail to parse new pdfs when we do,

from tika import parser
parsed = parser.from_file('/path/to/file')

Is this a common issue? How can I handle it? Is it possible to restart tika directly from my python code and make it work? Please help me

chrismattmann commented 1 year ago

This shouldn't be an issue? It may have to do with the Tika server you are using and memory limitations there. Please check. Thanks.

EvgeniyPaskin commented 1 year ago

I have the same issue. Seems like 100k files is the default number for tika server before restarting to prevent memory leaks. https://tika.apache.org/2.2.1/api/org/apache/tika/server/core/ServerStatus.STATUS.html#HIT_MAX_FILES https://cwiki.apache.org/confluence/display/TIKA/TikaServer+in+Tika+2.x (search for )

Unfortunately, I'm struggling to modify this config option with tika-python. Any help would be appreciated

chrismattmann commented 1 year ago

thanks @EvgeniyPaskin I think that you will probably have to keep track in an upstream Python program, and then just kill the backend server, wait a few seconds, then call Tika Python again which should restart the server. That should do it, cc @tballison

EvgeniyPaskin commented 1 year ago

@chrismattmann thanks for prompt and useful response. I've tried that, but in my case unfortunately for some reason tika.killServer() didn't work within my loop. I ended up with a separate call to start up a TikaServer with CLI/Terminal with adjusted tika_config.xml file: java -jar tika-server-standard-2.6.0.jar -config my_tika_config.xml and then letting your fabulous and awesome library do all the parsing logic. Thank you and your (+Tika) team for a great work!

chrismattmann commented 1 year ago

thank you so much @EvgeniyPaskin !

tballison commented 1 year ago

thanks @EvgeniyPaskin I think that you will probably have to keep track in an upstream Python program, and then just kill the backend server, wait a few seconds, then call Tika Python again which should restart the server. That should do it, cc @tballison

With Tika 2.x, tika-server will stop itself and restart on a crash, timeout or when it hits 100k docs to avoid memory leaks (as you found). The tika-python client should be able to be flexible enough to wait for a tika-server restart when it goes down and before it restarts. If you're processing enough files, tika-server will go down and have to restart itself. Bumping the 100k to something larger only delays your run-in with parsers gone bad.

If the tika-server is going down and not restarting (30 secs to a minute max?), please open a ticket or ping the upstream Tika user list.

Be-Rahul commented 8 months ago

Using Fedora Linux with kernel version 6.6.13-200.fc39.x86_64 and tika Version: 2.6.0 still getting error to parse 156,110 files. The execution get terminated with the connection error seems like tika-server is not getting restarted in given time.

requests.exceptions.ConnectionError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))