Closed ash-perfect closed 5 years ago
I later on learned that multi threading isn't the best solution for me. So I tried using multiprocessing. When I used this, the Camelot code isn't even being executed, the process just stops when it comes across the Camelot.read_pdf line. No error is given.
import camelot
import threading
import time
from multiprocessing import Process
import timeit
start=0
stop=0
def tstart():
global start
start = timeit.default_timer()
def tend():
global stop
stop = timeit.default_timer()
execution_time = stop - start
print("Program Executed in "+str(round(execution_time,4))," seconds")
def getTables(start,end):
print("is it working")
for i in range(10):
print('once')
tables = camelot.read_pdf('Stryker_SPD.pdf',pages=str(i+1))
time.sleep(2)
print(i)
if __name__ == "__main__":
p1 = Process(target=getTables,args=(10,20,))
#p2 = Process(target=getTables,args=(20,30,))
tstart()
p1.start()
#p2.start()
p1.join()
#p2.join()
tend()
See that I am using only one process to test it out. The output is:
is it working
once
Program Executed in 0.7326 seconds
If I remove the Camelot line it executes perfectly.
Please help with this.
I'm interested in making Camelot faster for large files...
I tried as much as possible, but nothing worked. I ended up using bash to multiprocess the parsing
@ash-perfect Let me try and reproduce this.
I'm interested in making Camelot faster for large files...
@anakin87 Using multiprocessing?
I managed to extract a 100 pages PDF using Camelot on 4 processes, which yielded good results: 157 seconds with multiprocessing against 374 seconds without.
Also, you are passing start and end to getTables but you're not using these parameters in the function. Try to set the pages argument to str(start) + '-' + str(end)
Closed in favor of https://github.com/camelot-dev/camelot/issues/20.
Hey,Everyone, I am also encountering an ISSUE on processing many PDF concurrently on a Server: I've created a Flask async Service on extracting tables for Users, In which I use Camelot to as the Core technique. While for better performance, I create a threading pool to submit the extraction task. And What I've found is that Camelot maybe not good on handling too much pdfs in threading pool method by using Ghost Script? I throw about 3-5 files in a docker. And some of the tasks just stuck & some others encounter the GhostscriptError: -100 ISSUE.
However, When I set the max_workers of the threading to 1, which means run task in Single Queue, I never have the ISSUE now. But as the tradeoff, I may deploy many instances to afford high concurrence.
@vinayak-mehta
@maximeboun can you share code snippet for multiprocessing?
@mlbrothers take a look at this for inspiration: https://camelot-py.readthedocs.io/en/master/user/faq.html#how-to-reduce-memory-usage-for-long-pdfs
I am not using multiprocessing but dividing the extraction into chunks. Maybe it can be a starting point...
As the title says, I have 200 pages and it takes around 4 mins to extract the tables from all the pages. So I decided to use multiple threads to extract faster.
I am using Jupiter Notebook and all the code below is in a single cell.
Here is my code:
When I run this the Kernel in my Jupyter Notebook either dies completely or Exception for either of the threads occurs and the other thread runs properly.
Exception goes like this
Please help me with this. Am I missing something?