chrismattmann / tika-python

Tika-Python is a Python binding to the Apache Tika™ REST services allowing Tika to be called natively in the Python community.
Apache License 2.0
1.49k stars 234 forks source link

use of tika parser with multiprocessing #210

Closed q0j0p closed 4 years ago

q0j0p commented 5 years ago

I was wondering if tika can be used with multiprocessing (in my case to scale up pdf text extraction)? Would this involve starting multiple jvms explicitly? I'd be interested in adding this functionality given some guidance. Thanks.

q0j0p commented 5 years ago

I have 6 workers in my multiprocessing pool:

(env1) ubuntu@ip-172-31-31-94:~$ sudo lsof -n -i :9998
COMMAND   PID   USER   FD   TYPE  DEVICE SIZE/OFF NODE NAME
java    25545 ubuntu   19u  IPv6 2011453      0t0  TCP 127.0.0.1:9998 (LISTEN)
java    25545 ubuntu   40u  IPv6 2043175      0t0  TCP 127.0.0.1:9998->127.0.0.1:53224 (ESTABLISHED)
java    25545 ubuntu   63u  IPv6 2043156      0t0  TCP 127.0.0.1:9998->127.0.0.1:53204 (ESTABLISHED)
java    25545 ubuntu  137u  IPv6 2043144      0t0  TCP 127.0.0.1:9998->127.0.0.1:53192 (ESTABLISHED)
java    25545 ubuntu  152u  IPv6 2043168      0t0  TCP 127.0.0.1:9998->127.0.0.1:53216 (ESTABLISHED)
java    25545 ubuntu  155u  IPv6 2043202      0t0  TCP 127.0.0.1:9998->127.0.0.1:53232 (ESTABLISHED)
java    25545 ubuntu  158u  IPv6 2043159      0t0  TCP 127.0.0.1:9998->127.0.0.1:53208 (ESTABLISHED)
python  27693 ubuntu   20u  IPv4 2043155      0t0  TCP 127.0.0.1:53204->127.0.0.1:9998 (ESTABLISHED)
python  27694 ubuntu   20u  IPv4 2043189      0t0  TCP 127.0.0.1:53232->127.0.0.1:9998 (ESTABLISHED)
python  27695 ubuntu   20u  IPv4 2043176      0t0  TCP 127.0.0.1:53224->127.0.0.1:9998 (ESTABLISHED)
python  27696 ubuntu   20u  IPv4 2043160      0t0  TCP 127.0.0.1:53208->127.0.0.1:9998 (ESTABLISHED)
python  27697 ubuntu   20u  IPv4 2043143      0t0  TCP 127.0.0.1:53192->127.0.0.1:9998 (ESTABLISHED)
python  27698 ubuntu   20u  IPv4 2043167      0t0  TCP 127.0.0.1:53216->127.0.0.1:9998 (ESTABLISHED)

It looks like each worker boots its own jvm (tika server), but they need to have unique endpoints. I'll see if I can iterate an initialization routine for the workers.

chrismattmann commented 5 years ago

this is great! If you get this working please contribute back. Each worker booting its own JVM is fine, up to a point. A common practice...

chrismattmann commented 4 years ago

if you get time for a PR please contribute it back.

SDAravind commented 9 months ago

@chrismattmann - currently, I'm looking for mutiprocessing. Is this implemented with current release? if yes, how do I invoke it?