Closed user06039 closed 1 year ago
This shouldn't be an issue? It may have to do with the Tika server you are using and memory limitations there. Please check. Thanks.
I have the same issue.
Seems like 100k files is the default number for tika server before restarting to prevent memory leaks.
https://tika.apache.org/2.2.1/api/org/apache/tika/server/core/ServerStatus.STATUS.html#HIT_MAX_FILES
https://cwiki.apache.org/confluence/display/TIKA/TikaServer+in+Tika+2.x (search for
Unfortunately, I'm struggling to modify this config option with tika-python. Any help would be appreciated
thanks @EvgeniyPaskin I think that you will probably have to keep track in an upstream Python program, and then just kill the backend server, wait a few seconds, then call Tika Python again which should restart the server. That should do it, cc @tballison
@chrismattmann thanks for prompt and useful response.
I've tried that, but in my case unfortunately for some reason tika.killServer() didn't work within my loop.
I ended up with a separate call to start up a TikaServer with CLI/Terminal with adjusted tika_config.xml file:
java -jar tika-server-standard-2.6.0.jar -config my_tika_config.xml
and then letting your fabulous and awesome library do all the parsing logic. Thank you and your (+Tika) team for a great work!
thank you so much @EvgeniyPaskin !
thanks @EvgeniyPaskin I think that you will probably have to keep track in an upstream Python program, and then just kill the backend server, wait a few seconds, then call Tika Python again which should restart the server. That should do it, cc @tballison
With Tika 2.x, tika-server will stop itself and restart on a crash, timeout or when it hits 100k docs to avoid memory leaks (as you found). The tika-python client should be able to be flexible enough to wait for a tika-server restart when it goes down and before it restarts. If you're processing enough files, tika-server will go down and have to restart itself. Bumping the 100k to something larger only delays your run-in with parsers gone bad.
If the tika-server is going down and not restarting (30 secs to a minute max?), please open a ticket or ping the upstream Tika user list.
Using Fedora Linux with kernel version 6.6.13-200.fc39.x86_64 and tika Version: 2.6.0 still getting error to parse 156,110 files. The execution get terminated with the connection error seems like tika-server is not getting restarted in given time.
requests.exceptions.ConnectionError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))
I'm using apache tika python client to parse pdf files but in my case I have more than a million documents. I think tika has some limitation where after parsing some 100k files then it starts to fail to parse new pdfs when we do,
Is this a common issue? How can I handle it? Is it possible to restart tika directly from my python code and make it work? Please help me