chrismattmann / tika-python

Tika-Python is a Python binding to the Apache Tika™ REST services allowing Tika to be called natively in the Python community.
Apache License 2.0
1.49k stars 234 forks source link

how to solve the 422 error code issue for parsing the doc file that containing many pictures? #272

Closed sunweiconfidence closed 4 years ago

sunweiconfidence commented 4 years ago

hi @chrismattmann

when i parse a doc file of 150MB, include 140MB pictures in it, every time i parse the file, i meet with 422 error status code using parser.from_buffer method, even if i set the TikaJavaArgs = os.getenv("TIKA_JAVA_ARGS", '-Xmx32g -Xmn20g') in tika.py to increase the JVM performance, still meet with 422 error statuscode, how can i fix the issue? thanks

sunweiconfidence commented 4 years ago

@chrismattmann have solved by read tika-server.jar sourcecode exception part, i found that when i save the doc file to docx format, file can be parsed easily

chrismattmann commented 4 years ago

thanks for this @sunweiconfidence !