Unstructured-IO / unstructured

Open source libraries and APIs to build custom preprocessing pipelines for labeling, training, or production machine learning pipelines.
https://www.unstructured.io/
Apache License 2.0
8.61k stars 704 forks source link

bug/Unable to download NLTK data #3617

Open TaylorN15 opened 2 weeks ago

TaylorN15 commented 2 weeks ago

Describe the bug Since the change was made to no longer use nltk.download() my application cannot download the required NLTK packages. The application is behind a firewall and we are only allowed to except specific traffic, and a public S3 bucket is proving difficult to get approved.

I get an error when it attempts to download the packages:

<urlopen error [Errno 104] Connection reset by peer>

To Reproduce Use a partitioner that requires NLTK

Expected behavior NLTK package download doesn't fail

Additional context Perhaps there is a way to include the required NLTK packages or pre-download them before the application is zipped and deployed?

Falven commented 1 week ago

This sounds like an IT problem unrelated to the framework, if you're behind a firewall how do you expect to download any NLTK packages?

I would recommend you Dockerize and cache the dependencies, building your container somewhere with internet access.

ENV NLTK_DATA=/usr/share/nltk_data
RUN mkdir -p $NLTK_DATA && chmod -R 777 $NLTK_DATA
RUN python -m nltk.downloader -d $NLTK_DATA stopwords punkt averaged_perceptron_tagger
TaylorN15 commented 1 week ago

We already have an exception for the NLTK packages as they are downloaded from GitHub, and this exception was already in place to allow certain Python packages and Oryx builds to work.

I'm just saying that someone else may encounter this same issue, as most IT departments won't allow access to an unknown public S3 bucket.