chrismattmann / tika-python

Tika-Python is a Python binding to the Apache Tika™ REST services allowing Tika to be called natively in the Python community.
Apache License 2.0
1.5k stars 234 forks source link

text extraction is slow #182

Closed lathakris closed 6 years ago

lathakris commented 6 years ago

Hi,

I am using tika-python for doing the text extraction. My setup is as follows. I run Java tika server on a docker container and the python client on another container. the python client has about 5 celery tasks which submit files in parallel to the tika server. I am using the 'unpack.from_file' interface of tika-python. (I used the 'parser.from_file' initially but it was very slow). I see that the round trip time is about 3 seconds. (time captured before and after the call). I am passing the file name and the server address in the call. It starts off fine initially - about 500 ms and then starts to increase. Any specific reason this is slow ? Am I missing anything ? I have affinitized one CPU core to tika.

Thanks in advance for your response.

lathakris commented 6 years ago

just an update, I see the below messages from Tika server..

2018-05-22 21:53:07,657 LathaDLP-NOX-18 user.notice tika: WARN No Unicode mapping for C76 (1) in font GAJFGL+Arial 2018-05-22 21:53:07,657 LathaDLP-NOX-18 user.notice tika: WARN No Unicode mapping for C105 (2) in font GAJFGL+Arial 2018-05-22 21:53:07,657 LathaDLP-NOX-18 user.notice tika: WARN No Unicode mapping for C118 (3) in font GAJFGL+Arial 2018-05-22 21:53:07,657 LathaDLP-NOX-18 user.notice tika: WARN No Unicode mapping for C110 (4) in font GAJFGL+Arial 2018-05-22 21:53:07,657 LathaDLP-NOX-18 user.notice tika: WARN No Unicode mapping for C103 (5) in font GAJFGL+Arial 2018-05-22 21:53:07,657 LathaDLP-NOX-18 user.notice tika: WARN No Unicode mapping for C82 (6) in font GAJFGL+Arial 2018-05-22 21:53:07,657 LathaDLP-NOX-18 user.notice tika: WARN No Unicode mapping for C111 (7) in font GAJFGL+Arial 2018-05-22 21:53:07,657 LathaDLP-NOX-18 user.notice tika: WARN No Unicode mapping for C109 (8) in font GAJFGL+Arial 2018-05-22 21:53:07,657 LathaDLP-NOX-18 user.notice tika: WARN No Unicode mapping for C68 (9) in font GAJFGL+Arial 2018-05-22 21:53:07,657 LathaDLP-NOX-18 user.notice tika: WARN No Unicode mapping for C66 (10) in font GAJFGL+Arial 2018-05-22 21:53:07,657 LathaDLP-NOX-18 user.notice tika: WARN No Unicode mapping for C97 (11) in font GAJFGL+Arial 2018-05-22 21:53:07,659 LathaDLP-NOX-18 user.notice tika: WARN No Unicode mapping for C115 (12) in font GAJFGL+Arial 2018-05-22 21:53:07,659 LathaDLP-NOX-18 user.notice tika: WARN No Unicode mapping for C101 (13) in font GAJFGL+Arial 2018-05-22 21:53:07,659 LathaDLP-NOX-18 user.notice tika: WARN No Unicode mapping for C116 (14) in font GAJFGL+Arial 2018-05-22 21:53:07,659 LathaDLP-NOX-18 user.notice tika: WARN No Unicode mapping for C120 (15) in font GAJFGL+Arial 2018-05-22 21:53:07,683 LathaDLP-NOX-18 user.notice tika: WARN No Unicode mapping for C72 (16) in font GAJFGL+Arial 2018-05-22 21:53:07,683 LathaDLP-NOX-18 user.notice tika: WARN No Unicode mapping for C108 (17) in font GAJFGL+Arial 2018-05-22 21:53:07,683 LathaDLP-NOX-18 user.notice tika: WARN No Unicode mapping for C100 (18) in font GAJFGL+Arial 2018-05-22 21:53:07,683 LathaDLP-NOX-18 user.notice tika: WARN No Unicode mapping for C114 (19) in font GAJFGL+Arial 2018-05-22 21:53:07,683 LathaDLP-NOX-18 user.notice tika: WARN No Unicode mapping for C99 (20) in font GAJFGL+Arial 2018-05-22 21:53:07,683 LathaDLP-NOX-18 user.notice tika: WARN No Unicode mapping for C75 (21) in font GAJFGL+Arial 2018-05-22 21:53:07,683 LathaDLP-NOX-18 user.notice tika: WARN No Unicode mapping for C104 (22) in font GAJFGL+Arial 2018-05-22 21:53:07,683 LathaDLP-NOX-18 user.notice tika: WARN No Unicode mapping for C65 (1) in font GAJFJF+ArialBOLD 2018-05-22 21:53:07,683 LathaDLP-NOX-18 user.notice tika: WARN No Unicode mapping for C32 (2) in font GAJFJF+ArialBOLD 2018-05-22 21:53:07,683 LathaDLP-NOX-18 user.notice tika: WARN No Unicode mapping for C115 (3) in font GAJFJF+ArialBOLD 2018-05-22 21:53:07,683 LathaDLP-NOX-18 user.notice tika: WARN No Unicode mapping for C109 (4) in font GAJFJF+ArialBOLD 2018-05-22 21:53:07,685 LathaDLP-NOX-18 user.notice tika: WARN No Unicode mapping for C111 (5) in font GAJFJF+ArialBOLD 2018-05-22 21:53:07,685 LathaDLP-NOX-18 user.notice tika: WARN No Unicode mapping for C107 (6) in font GAJFJF+ArialBOLD 2018-05-22 21:53:07,685 LathaDLP-NOX-18 user.notice tika: WARN No Unicode mapping for C101 (7) in font GAJFJF+ArialBOLD 2018-05-22 21:53:07,685 LathaDLP-NOX-18 user.notice tika: WARN No Unicode mapping for C100 (8) in font GAJFJF+ArialBOLD 2018-05-22 21:53:07,685 LathaDLP-NOX-18 user.notice tika: WARN No Unicode mapping for C116 (9) in font GAJFJF+ArialBOLD 2018-05-22 21:53:07,685 LathaDLP-NOX-18 user.notice tika: WARN No Unicode mapping for C99 (10) in font GA

chrismattmann commented 6 years ago

do you have Teseract installed? If so that may be slowing it down. Further you may want to take this upstream to dev@tika.apache.org since we just call Tika JAXRS/REST and don't introduce any overhead. See: #181

lathakris commented 6 years ago

Thank you for your response, I will check with them. I do not have Teseract installed.

ghost commented 4 years ago

Did you ever solved the problem? Iam suffering from the same problem at the moment. I programmed a PDF-parsing tool using a local Tika Server. And its slow as hell. When using it online over the Python Tika libary, its approximately 100x (not exaggerating)

If someone could provide a solution I would be really greatful!

chrismattmann commented 4 years ago

hi @Dreak1803 so you are saying it's 100x slower over Python Tika than just (I assume) directly contacting a local Tika Server? Perhaps do a memory or CPU profile when both are running? Capture net stats...it shouldn't be any difference since Tika Python is simply a REST client to Tika Server with lots of conveniences.