Unstructured-IO / unstructured

Open source libraries and APIs to build custom preprocessing pipelines for labeling, training, or production machine learning pipelines.
https://www.unstructured.io/
Apache License 2.0
9.27k stars 771 forks source link

CPU only installation #3326

Open arthurbrenno opened 5 months ago

arthurbrenno commented 5 months ago

I've been using unstructured for a while in a 100% cpu machine. I've noticed a lot of nvidia files (+2gb) in my venv folder coming from PyTorch (possible one of unstructured's dependencies).

Can I install a cpu-only version of unstructured? Because I've been partitioning for a while and no gpu used.

Here is my requirements.in file:

uvicorn[standard]==0.25.0
fastapi==0.111.0
pyyaml==6.0.1
injector==0.21.0
overrides==7.7.0
langchain==0.2.5
langchain-google-genai==1.0.6
json-repair==0.9.0
unstructured[pptx,image,docx,pdf]==0.14.9
opencv-python-headless==4.9.0.80
jq==1.6.0
pytesseract==0.3.10
pymilvus==2.3.6
langchain-openai==0.1.8
scikit-learn==1.5.0
ruff==0.3.1
pandas==2.2.1
llama-index==0.10.33
python-multipart==0.0.9
llama-index-vector-stores-milvus==0.1.10
playwright==1.43.0
python-magic==0.4.27
llama-index-llms-gemini==0.1.11
opencv-python==4.9.0.80
llama-index-llms-anthropic==0.1.11
llama-index-llms-ollama==0.1.5
llama-index-embeddings-ollama==0.1.2
pymupdf==1.24.4
pypdf[image]==4.2.0
llama-index-multi-modal-llms-ollama==0.1.3
llama-index-llms-groq==0.1.4
gensim==3.6.0
firebase-admin==6.5.0
demjson3==3.0.6
langchain-community==0.2.5
jsonschema==4.22.0
pypdf2==3.0.1
fpdf==1.7.2
moviepy==1.0.3
neo4j==5.21.0
llama-index-graph-stores-neo4j==0.2.5
pylatex==1.4.2
reportlab==4.2.0
psutil==5.9.8
fastapi-utils==0.7.0
colorama==0.4.6
humanize==4.9.0
objgraph==3.6.1
imgkit==1.2.3
pyppeteer==2.0.0
wkhtmltopdf==0.2
llama-agents==0.0.3
click==8.1.7
mypy==1.10.1

Note that there's no torch on it

MthwRobinson commented 5 months ago

Thanks for the suggested @arthurbrenno . We'll take a look at this. I think this would have the side benefit of reducing the size of our CPU images.

arthurbrenno commented 4 months ago

Tysm! It would save us about 3gb of storage.

belmmostest commented 4 months ago

@arthurbrenno see here #2976

sidatcd commented 4 months ago

Installing torch-cpu before the unstructured libs should be of help. This will not install the nvidia gpu libs for pytorch. This is what i Have been doing to build lambda images. image

arthurbrenno commented 4 months ago

Thank you, @sidatcd!

jaideep11061982 commented 1 month ago

@sidatcd i have a need to accelerate the unstructured IO , can it support GPU ? if yes what are the steps to make it use GPU

pastram-i commented 1 month ago

Installing torch-cpu before the unstructured libs should be of help. This will not install the nvidia gpu libs for pytorch. This is what i Have been doing to build lambda images. image

For anyone who uses poetry, you can accomplish this in your pyproject.toml with these commands:

$ poetry source add --priority=explicit pytorch-cpu https://download.pytorch.org/whl/cpu
$ poetry add --source pytorch-cpu torch

The result in your pyrpoject.toml will look like this

onnxruntime = "^1.18.1"
torch = {version = "^2.5.0+cpu", source = "pytorch-cpu"}
unstructured = {extras = ["csv", "doc", "docx", "pdf", "ppt", "pptx", "xlsx"], version = "^0.16.3"}

[[tool.poetry.source]]
name = "pytorch-cpu"
url = "https://download.pytorch.org/whl/cpu"
priority = "explicit"

Sources: https://github.com/python-poetry/poetry/issues/7685 https://github.com/python-poetry/poetry/pull/8246/commits/948f3a9b95a200525223b897beaa92c8b255a444

That side - I +1 having a CPU only unstructured option to handle this.