Closed lathakris closed 6 years ago
just an update, I see the below messages from Tika server..
2018-05-22 21:53:07,657 LathaDLP-NOX-18 user.notice tika: WARN No Unicode mapping for C76 (1) in font GAJFGL+Arial 2018-05-22 21:53:07,657 LathaDLP-NOX-18 user.notice tika: WARN No Unicode mapping for C105 (2) in font GAJFGL+Arial 2018-05-22 21:53:07,657 LathaDLP-NOX-18 user.notice tika: WARN No Unicode mapping for C118 (3) in font GAJFGL+Arial 2018-05-22 21:53:07,657 LathaDLP-NOX-18 user.notice tika: WARN No Unicode mapping for C110 (4) in font GAJFGL+Arial 2018-05-22 21:53:07,657 LathaDLP-NOX-18 user.notice tika: WARN No Unicode mapping for C103 (5) in font GAJFGL+Arial 2018-05-22 21:53:07,657 LathaDLP-NOX-18 user.notice tika: WARN No Unicode mapping for C82 (6) in font GAJFGL+Arial 2018-05-22 21:53:07,657 LathaDLP-NOX-18 user.notice tika: WARN No Unicode mapping for C111 (7) in font GAJFGL+Arial 2018-05-22 21:53:07,657 LathaDLP-NOX-18 user.notice tika: WARN No Unicode mapping for C109 (8) in font GAJFGL+Arial 2018-05-22 21:53:07,657 LathaDLP-NOX-18 user.notice tika: WARN No Unicode mapping for C68 (9) in font GAJFGL+Arial 2018-05-22 21:53:07,657 LathaDLP-NOX-18 user.notice tika: WARN No Unicode mapping for C66 (10) in font GAJFGL+Arial 2018-05-22 21:53:07,657 LathaDLP-NOX-18 user.notice tika: WARN No Unicode mapping for C97 (11) in font GAJFGL+Arial 2018-05-22 21:53:07,659 LathaDLP-NOX-18 user.notice tika: WARN No Unicode mapping for C115 (12) in font GAJFGL+Arial 2018-05-22 21:53:07,659 LathaDLP-NOX-18 user.notice tika: WARN No Unicode mapping for C101 (13) in font GAJFGL+Arial 2018-05-22 21:53:07,659 LathaDLP-NOX-18 user.notice tika: WARN No Unicode mapping for C116 (14) in font GAJFGL+Arial 2018-05-22 21:53:07,659 LathaDLP-NOX-18 user.notice tika: WARN No Unicode mapping for C120 (15) in font GAJFGL+Arial 2018-05-22 21:53:07,683 LathaDLP-NOX-18 user.notice tika: WARN No Unicode mapping for C72 (16) in font GAJFGL+Arial 2018-05-22 21:53:07,683 LathaDLP-NOX-18 user.notice tika: WARN No Unicode mapping for C108 (17) in font GAJFGL+Arial 2018-05-22 21:53:07,683 LathaDLP-NOX-18 user.notice tika: WARN No Unicode mapping for C100 (18) in font GAJFGL+Arial 2018-05-22 21:53:07,683 LathaDLP-NOX-18 user.notice tika: WARN No Unicode mapping for C114 (19) in font GAJFGL+Arial 2018-05-22 21:53:07,683 LathaDLP-NOX-18 user.notice tika: WARN No Unicode mapping for C99 (20) in font GAJFGL+Arial 2018-05-22 21:53:07,683 LathaDLP-NOX-18 user.notice tika: WARN No Unicode mapping for C75 (21) in font GAJFGL+Arial 2018-05-22 21:53:07,683 LathaDLP-NOX-18 user.notice tika: WARN No Unicode mapping for C104 (22) in font GAJFGL+Arial 2018-05-22 21:53:07,683 LathaDLP-NOX-18 user.notice tika: WARN No Unicode mapping for C65 (1) in font GAJFJF+ArialBOLD 2018-05-22 21:53:07,683 LathaDLP-NOX-18 user.notice tika: WARN No Unicode mapping for C32 (2) in font GAJFJF+ArialBOLD 2018-05-22 21:53:07,683 LathaDLP-NOX-18 user.notice tika: WARN No Unicode mapping for C115 (3) in font GAJFJF+ArialBOLD 2018-05-22 21:53:07,683 LathaDLP-NOX-18 user.notice tika: WARN No Unicode mapping for C109 (4) in font GAJFJF+ArialBOLD 2018-05-22 21:53:07,685 LathaDLP-NOX-18 user.notice tika: WARN No Unicode mapping for C111 (5) in font GAJFJF+ArialBOLD 2018-05-22 21:53:07,685 LathaDLP-NOX-18 user.notice tika: WARN No Unicode mapping for C107 (6) in font GAJFJF+ArialBOLD 2018-05-22 21:53:07,685 LathaDLP-NOX-18 user.notice tika: WARN No Unicode mapping for C101 (7) in font GAJFJF+ArialBOLD 2018-05-22 21:53:07,685 LathaDLP-NOX-18 user.notice tika: WARN No Unicode mapping for C100 (8) in font GAJFJF+ArialBOLD 2018-05-22 21:53:07,685 LathaDLP-NOX-18 user.notice tika: WARN No Unicode mapping for C116 (9) in font GAJFJF+ArialBOLD 2018-05-22 21:53:07,685 LathaDLP-NOX-18 user.notice tika: WARN No Unicode mapping for C99 (10) in font GA
do you have Teseract installed? If so that may be slowing it down. Further you may want to take this upstream to dev@tika.apache.org since we just call Tika JAXRS/REST and don't introduce any overhead. See: #181
Thank you for your response, I will check with them. I do not have Teseract installed.
Did you ever solved the problem? Iam suffering from the same problem at the moment. I programmed a PDF-parsing tool using a local Tika Server. And its slow as hell. When using it online over the Python Tika libary, its approximately 100x (not exaggerating)
If someone could provide a solution I would be really greatful!
hi @Dreak1803 so you are saying it's 100x slower over Python Tika than just (I assume) directly contacting a local Tika Server? Perhaps do a memory or CPU profile when both are running? Capture net stats...it shouldn't be any difference since Tika Python is simply a REST client to Tika Server with lots of conveniences.
Hi,
I am using tika-python for doing the text extraction. My setup is as follows. I run Java tika server on a docker container and the python client on another container. the python client has about 5 celery tasks which submit files in parallel to the tika server. I am using the 'unpack.from_file' interface of tika-python. (I used the 'parser.from_file' initially but it was very slow). I see that the round trip time is about 3 seconds. (time captured before and after the call). I am passing the file name and the server address in the call. It starts off fine initially - about 500 ms and then starts to increase. Any specific reason this is slow ? Am I missing anything ? I have affinitized one CPU core to tika.
Thanks in advance for your response.