Unstructured-IO / unstructured

Open source libraries and APIs to build custom preprocessing pipelines for labeling, training, or production machine learning pipelines.
https://www.unstructured.io/
Apache License 2.0
9.15k stars 755 forks source link

bug/doesnt work offline #1080

Closed Fluder-Paradyne closed 1 year ago

Fluder-Paradyne commented 1 year ago

Describe the bug When internet is slow I am getting this error, I want my application to run offline

To Reproduce

from unstructured.partition.auto import partition
elements = partition(file_path)
content = "\n\n".join([str(el) for el in elements])

Expected behavior should be able to read the file

Screenshots If applicable, add screenshots to help explain your problem. image

Environment Info debian 11 ( docker python:3.10-slim-bullseye )

Additional context If you can point out what to download and where to place the files in. I think I can make it happen in the docker build step itself so that the download doesnt have to start everytime

thanks

cragwolfe commented 1 year ago

yes, you can set the env var NLTK_DATA in your Dockerfile to the directory location to download the NLTK data to, then do something like:

https://github.com/Unstructured-IO/unstructured/blob/331c7fa/Dockerfile#L45-L46

Fluder-Paradyne commented 1 year ago

Got it thanks

deku0818 commented 7 months ago

I found that every time I use it, I will try to download it. How to make him use it instead of downloading it?

Fluder-Paradyne commented 7 months ago

if you are using docker add this line to your docker file

RUN python3 -c "import nltk; nltk.download('punkt')" && \
  python3 -c "import nltk; nltk.download('averaged_perceptron_tagger')"

or just run in your terminal, this will download to a NLTK folder in your local machine which should be re-used, if it is not happening then add an env NLTK_DATA with the downloaded folder path as its value

 python3 -c "import nltk; nltk.download('punkt')" && \
  python3 -c "import nltk; nltk.download('averaged_perceptron_tagger')"
deku0818 commented 7 months ago

if you are using docker add this line to your docker file

RUN python3 -c "import nltk; nltk.download('punkt')" && \
  python3 -c "import nltk; nltk.download('averaged_perceptron_tagger')"

or just run in your terminal, this will download to a NLTK folder in your local machine which should be re-used, if it is not happening then add an env NLTK_DATA with the downloaded folder path as its value

 python3 -c "import nltk; nltk.download('punkt')" && \
  python3 -c "import nltk; nltk.download('averaged_perceptron_tagger')"

Thank you,add an env NLTK_DATA is effective.