chrismattmann / tika-python

Tika-Python is a Python binding to the Apache Tika™ REST services allowing Tika to be called natively in the Python community.
Apache License 2.0
1.49k stars 234 forks source link

Setting heap space for tika #331

Closed sany2k8 closed 3 years ago

sany2k8 commented 3 years ago

I am parsing 200mb-500mb pdf file using python-tika jar and it works but when I try with 1.3gb file the tika server not able to do that. As per my tika-server.log investigation I found this error

java.lang.OutOfMemoryError: Java heap space.

So my question is how to set heap/memory space while running tika or anything need to set from python code to increase that size?

What about this environment variable? Will it able to fix that issue to parse large PDF?

TIKA_JAVA_ARGS - set java runtime arguments, e.g, -Xmx4g

I've tried this configuration but no luck

# Update the required variables
tika.TikaServerLogFilePath = os.getenv('TIKA_LOG_PATH', abs_path + '/logs')
tika.TikaJarPath = os.getenv('TIKA_PATH', abs_path)
tika.TikaJavaArgs = os.getenv('TIKA_JAVA_ARGS', '-Xmx8g')
gooseillo commented 3 years ago

I was able to increase the heap space using this command

You need to make sure the server is not started. If it is you can kill it from the task manager.

import tika tika.tika.TikaJavaArgs = '-Xmx16g'

if you run the parser now it will run using the above arguement. 16gig is the amount of heap space assigned. You can change that number to a higher number if you need additional heap space.

Make sure you are running Java 64 bit other wise the max heapspace for 32bit is less than 2 gigs.

chrismattmann commented 3 years ago

Thank you @gooseillo !