Closed uprg closed 3 months ago
Hi @uprg, a long runtime is expected on a PDF of that length. If you need to process that fast, try using our API through our client SDK. That will split up the PDF and make run partition across multiple workers.
Hi @MthwRobinson thank you for your reply, just wanted to know can I run the partition on multiple workers with python sdk not the client sdk.
It's not built into the unstructured
library, but if you want to split up a PDF and run in parallel you could do something similar to what we've done in the client library code. The PDF splitting module is here if you'd like to take a look.
Python version
3.8.10
GCC version
9.4.0
unstructured version
0.11.8
unstructured-client version
0.23.3
unstructured-inference version
0.7.18
unstructured.pytesseract version
0.3.12
onnxruntime python lib version
1.15.1
OS version
Ubuntu 20.04.6 LTS
Specs
CPU: 11th Gen Intel® Core™ i5-1135G7 @ 2.40GHz × 8 RAM: 23.3 GiB GPU: No GPU Integrated Graphics: Mesa Intel® Xe Graphics (TGL GT2)
Issue
while partitioning a 700 pages it take little over 1 hour to do extraction with hi_res method.
Expected behavior
Should less time than 1 hour
To reproduce
Install unstructured io Python SDK with (python3.8 -m pip install "unstrucutred[all-docs]") and use this below code:
Hi, above is my code used to extract data from a 700 pages pdf which took one hour or sometimes little over one hour. What steps should i take to lower the time? or is there something wrong with my code or way i am doing with unstructured
The whole code runs on CPU no GPU is there. Should i run it on GPU? If yes, How can i do it? or can i do some optimizations in CPU?
Thank you for your response