atlanhq / camelot

Camelot: PDF Table Extraction for Humans
3.62k stars 350 forks source link

Using multithreading to extract tables from a large PDF #347

Closed ash-perfect closed 5 years ago

ash-perfect commented 5 years ago

As the title says, I have 200 pages and it takes around 4 mins to extract the tables from all the pages. So I decided to use multiple threads to extract faster.

I am using Jupiter Notebook and all the code below is in a single cell.

Here is my code:

import camelot
import threading
import time

def getTables(start,end):
    for i in range(start,end):
        tables = camelot.read_pdf('Stryker_SPD.pdf',pages=str(i+1))
if __name__ == "__main__":
    t1 = threading.Thread(target=getTables, name='t1',args=(10,20,)) 
    t2 = threading.Thread(target=getTables, name='t2',args=(30,40))


When I run this the Kernel in my Jupyter Notebook either dies completely or Exception for either of the threads occurs and the other thread runs properly.

Exception goes like this

Exception in thread t2:
Traceback (most recent call last):
  File "/anaconda3/lib/python3.6/", line 916, in _bootstrap_inner
  File "/anaconda3/lib/python3.6/", line 864, in run
    self._target(*self._args, **self._kwargs)
  File "<ipython-input-1-6180219f4c48>", line 18, in getTables
    tables = camelot.read_pdf('Stryker_SPD.pdf',pages=str(i+1))
  File "/anaconda3/lib/python3.6/site-packages/camelot/", line 106, in read_pdf
    layout_kwargs=layout_kwargs, **kwargs)
  File "/anaconda3/lib/python3.6/site-packages/camelot/", line 162, in parse
  File "/anaconda3/lib/python3.6/site-packages/camelot/parsers/", line 351, in extract_tables
  File "/anaconda3/lib/python3.6/site-packages/camelot/parsers/", line 193, in _generate_image
    with Ghostscript(*gs_call, stdout=null) as gs:
  File "/anaconda3/lib/python3.6/site-packages/camelot/ext/ghostscript/", line 89, in Ghostscript
    __instance__ = gs.new_instance()
  File "/anaconda3/lib/python3.6/site-packages/camelot/ext/ghostscript/", line 71, in new_instance
    raise GhostscriptError(rc)
camelot.ext.ghostscript._gsprint.GhostscriptError: -100

Please help me with this. Am I missing something?

ash-perfect commented 5 years ago

I later on learned that multi threading isn't the best solution for me. So I tried using multiprocessing. When I used this, the Camelot code isn't even being executed, the process just stops when it comes across the Camelot.read_pdf line. No error is given.

import camelot
import threading
import time

from multiprocessing import Process

import timeit
def tstart():
    global start
    start = timeit.default_timer()
def tend():
    global stop
    stop = timeit.default_timer()
    execution_time = stop - start
    print("Program Executed in "+str(round(execution_time,4))," seconds")

def getTables(start,end):
    print("is it working")
    for i in range(10):
        tables = camelot.read_pdf('Stryker_SPD.pdf',pages=str(i+1))

if __name__ == "__main__":
    p1 = Process(target=getTables,args=(10,20,))
    #p2 = Process(target=getTables,args=(20,30,))


See that I am using only one process to test it out. The output is:

is it working
Program Executed in 0.7326  seconds

If I remove the Camelot line it executes perfectly.

Please help with this.

anakin87 commented 5 years ago

I'm interested in making Camelot faster for large files...

ash-perfect commented 5 years ago

I tried as much as possible, but nothing worked. I ended up using bash to multiprocess the parsing

vinayak-mehta commented 5 years ago

@ash-perfect Let me try and reproduce this.

I'm interested in making Camelot faster for large files...

@anakin87 Using multiprocessing?

maximeboun commented 5 years ago

I managed to extract a 100 pages PDF using Camelot on 4 processes, which yielded good results: 157 seconds with multiprocessing against 374 seconds without. Also, you are passing start and end to getTables but you're not using these parameters in the function. Try to set the pages argument to str(start) + '-' + str(end)

vinayak-mehta commented 5 years ago

Closed in favor of

LinanYaooo commented 3 years ago

Hey,Everyone, I am also encountering an ISSUE on processing many PDF concurrently on a Server: I've created a Flask async Service on extracting tables for Users, In which I use Camelot to as the Core technique. While for better performance, I create a threading pool to submit the extraction task. And What I've found is that Camelot maybe not good on handling too much pdfs in threading pool method by using Ghost Script? I throw about 3-5 files in a docker. And some of the tasks just stuck & some others encounter the GhostscriptError: -100 ISSUE.

However, When I set the max_workers of the threading to 1, which means run task in Single Queue, I never have the ISSUE now. But as the tradeoff, I may deploy many instances to afford high concurrence.


mlbrothers commented 1 year ago

@maximeboun can you share code snippet for multiprocessing?

anakin87 commented 1 year ago

@mlbrothers take a look at this for inspiration:

I am not using multiprocessing but dividing the extraction into chunks. Maybe it can be a starting point...