Open blenzi opened 2 years ago
It seems you have some overhead somewhere else since the total elapsed time is way less than 4s. What is your pipeline and how do you call Parsr's API?
In order to time the full operation I am using the python client in a jupyter notebook:
%%timeit -n1 -r1
parsr.send_document(
file_path=pdf_file,
config_path='/tmp/parsr_config.json',
document_name='Test',
save_request_id=True)
while 'progress-percentage' in parsr.get_status()['server_response']:
time.sleep(0.1)
The client is instantiated with
from parsr_client import ParsrClient
parsr = ParsrClient('localhost:3001')
and the API is running via its docker image.
Following the discussion on https://github.com/axa-group/Parsr/issues/510, I am testing Parsr also for small files and without (or with few) modules. With
README.pdf
provided in the samples (8 pages):The first stage already takes 1.5s. The total time to invoke the API and retrieve the done status is over 4s. As a comparison, PyMuPDF takes about 40 ms. For a 40-page document the numbers are 10s vs 200 ms. Any idea how to speed it up? The config is below: