Unstructured-IO / unstructured

Open source libraries and APIs to build custom preprocessing pipelines for labeling, training, or production machine learning pipelines.
https://www.unstructured.io/
Apache License 2.0
7.44k stars 580 forks source link

CPU only installation #3326

Open arthurbrenno opened 3 days ago

arthurbrenno commented 3 days ago

I've been using unstructured for a while in a 100% cpu machine. I've noticed a lot of nvidia files (+2gb) in my venv folder coming from PyTorch (possible one of unstructured's dependencies).

Can I install a cpu-only version of unstructured? Because I've been partitioning for a while and no gpu used.

Here is my requirements.in file:

uvicorn[standard]==0.25.0
fastapi==0.111.0
pyyaml==6.0.1
injector==0.21.0
overrides==7.7.0
langchain==0.2.5
langchain-google-genai==1.0.6
json-repair==0.9.0
unstructured[pptx,image,docx,pdf]==0.14.9
opencv-python-headless==4.9.0.80
jq==1.6.0
pytesseract==0.3.10
pymilvus==2.3.6
langchain-openai==0.1.8
scikit-learn==1.5.0
ruff==0.3.1
pandas==2.2.1
llama-index==0.10.33
python-multipart==0.0.9
llama-index-vector-stores-milvus==0.1.10
playwright==1.43.0
python-magic==0.4.27
llama-index-llms-gemini==0.1.11
opencv-python==4.9.0.80
llama-index-llms-anthropic==0.1.11
llama-index-llms-ollama==0.1.5
llama-index-embeddings-ollama==0.1.2
pymupdf==1.24.4
pypdf[image]==4.2.0
llama-index-multi-modal-llms-ollama==0.1.3
llama-index-llms-groq==0.1.4
gensim==3.6.0
firebase-admin==6.5.0
demjson3==3.0.6
langchain-community==0.2.5
jsonschema==4.22.0
pypdf2==3.0.1
fpdf==1.7.2
moviepy==1.0.3
neo4j==5.21.0
llama-index-graph-stores-neo4j==0.2.5
pylatex==1.4.2
reportlab==4.2.0
psutil==5.9.8
fastapi-utils==0.7.0
colorama==0.4.6
humanize==4.9.0
objgraph==3.6.1
imgkit==1.2.3
pyppeteer==2.0.0
wkhtmltopdf==0.2
llama-agents==0.0.3
click==8.1.7
mypy==1.10.1

Note that there's no torch on it

MthwRobinson commented 2 days ago

Thanks for the suggested @arthurbrenno . We'll take a look at this. I think this would have the side benefit of reducing the size of our CPU images.

arthurbrenno commented 20 hours ago

Tysm! It would save us about 3gb of storage.