Unstructured-IO / unstructured

Open source libraries and APIs to build custom preprocessing pipelines for labeling, training, or production machine learning pipelines.
https://www.unstructured.io/
Apache License 2.0
7.44k stars 580 forks source link

Sensitive data security issues #3178

Closed arkim822 closed 3 weeks ago

arkim822 commented 3 weeks ago

Hello,

I am trying out the local installation of unstructured

pip install unstructured

and I see it has some interaction with Hugging Face. Is the data being that is being parsed (pdfs) being sent their to run on Hugging Face's servers or is everything being run locally?

Thank you.

MthwRobinson commented 3 weeks ago

Hi @arkim822 - all processing with the unstructured library is performed locally. The only interaction with huggingface is to download document understanding models for processing PDFs and images. Once the model files are downloaded, no network connection is required. See below for the recommended pattern for downloading the required models.

https://github.com/Unstructured-IO/unstructured/blob/c822e3fd10026639b8183846008263ff5b6b02a9/Dockerfile-amd64#L36-L39

arkim822 commented 3 weeks ago

Hi @MthwRobinson, I appreciate the quick response. That's great to hear :)

I'll let my peers know.