Unstructured-IO / unstructured

Open source libraries and APIs to build custom preprocessing pipelines for labeling, training, or production machine learning pipelines.
https://www.unstructured.io/
Apache License 2.0
8.64k stars 705 forks source link

Unstructured data extraction using Hi_res method takes 1 hour to extract data from a 700 pages pdf. #3217

Closed uprg closed 3 months ago

uprg commented 3 months ago

Python version

3.8.10

GCC version

9.4.0

unstructured version

0.11.8

unstructured-client version

0.23.3

unstructured-inference version

0.7.18

unstructured.pytesseract version

0.3.12

onnxruntime python lib version

1.15.1

OS version

Ubuntu 20.04.6 LTS

Specs

CPU: 11th Gen Intel® Core™ i5-1135G7 @ 2.40GHz × 8 RAM: 23.3 GiB GPU: No GPU Integrated Graphics: Mesa Intel® Xe Graphics (TGL GT2)

Issue

while partitioning a 700 pages it take little over 1 hour to do extraction with hi_res method.

Expected behavior

Should less time than 1 hour

To reproduce

Install unstructured io Python SDK with (python3.8 -m pip install "unstrucutred[all-docs]") and use this below code:

from unstructured.partition.auto import partition
from unstructured.partition.strategies import PartitionStrategy

result = partition(
    filename="./700-Mega.pdf",
    content_type="application/pdf",
    strategy=PartitionStrategy.HI_RES,
    hi_res_model_name="detectron2_onnx",
    include_page_breaks=True,
    encoding="utf-8",
    skip_infer_table_types=[]
)

Hi, above is my code used to extract data from a 700 pages pdf which took one hour or sometimes little over one hour. What steps should i take to lower the time? or is there something wrong with my code or way i am doing with unstructured

The whole code runs on CPU no GPU is there. Should i run it on GPU? If yes, How can i do it? or can i do some optimizations in CPU?

Thank you for your response

MthwRobinson commented 3 months ago

Hi @uprg, a long runtime is expected on a PDF of that length. If you need to process that fast, try using our API through our client SDK. That will split up the PDF and make run partition across multiple workers.

uprg commented 3 months ago

Hi @MthwRobinson thank you for your reply, just wanted to know can I run the partition on multiple workers with python sdk not the client sdk.

MthwRobinson commented 3 months ago

It's not built into the unstructured library, but if you want to split up a PDF and run in parallel you could do something similar to what we've done in the client library code. The PDF splitting module is here if you'd like to take a look.