Unstructured-IO / unstructured

Open source libraries and APIs to build custom preprocessing pipelines for labeling, training, or production machine learning pipelines.
https://www.unstructured.io/
Apache License 2.0
7.54k stars 595 forks source link

Not respecting NLTK_DATA environment variable #3125

Closed TaylorN15 closed 1 month ago

TaylorN15 commented 1 month ago

Describe the bug I'm not sure if this is an issue with unstructured or nltk...

I am running on Azure Functions in an App Service Environment which is within an internal network and all outbound traffic is blocked and allowed by exception only. I have downloaded the required NLTK packages and stored then with the functions code, and set an environment variable for NLTK_DATA on the app config. But it still tries to download the NLTK packages and times out (fails). If I (manually) copy the nltk_data folder to ~/nltk_data/ it works, but this is not viable as this directory is volatile.

To Reproduce Block access to nltk.org, run any partition function that requires NLTK packages.

Expected behavior The code should check environment variable NLTK_DATA

After timing out, I assume its trying to unzip the downloaded NLTK data, which doesn't exist. The stack trace indicates an issue with python-docx but I couldn't find NLTK referenced in there.

  File "/home/site/wwwroot/.python_packages/lib/site-packages/unstructured/partition/docx.py", line 423, in _document
    return docx.Document(file)
  File "/home/site/wwwroot/.python_packages/lib/site-packages/docx/api.py", line 27, in Document
    document_part = cast("DocumentPart", Package.open(docx).main_document_part)
  File "/home/site/wwwroot/.python_packages/lib/site-packages/docx/opc/package.py", line 127, in open
    pkg_reader = PackageReader.from_file(pkg_file)
  File "/home/site/wwwroot/.python_packages/lib/site-packages/docx/opc/pkgreader.py", line 22, in from_file
    phys_reader = PhysPkgReader(pkg_file)
  File "/home/site/wwwroot/.python_packages/lib/site-packages/docx/opc/phys_pkg.py", line 76, in __init__
    self._zipf = ZipFile(pkg_file, "r")
  File "/usr/local/lib/python3.10/zipfile.py", line 1271, in __init__
    self._RealGetContents()
  File "/usr/local/lib/python3.10/zipfile.py", line 1338, in _RealGetContents
    raise BadZipFile("File is not a zip file")
zipfile.BadZipFile: File is not a zip file
MthwRobinson commented 1 month ago

@scanny - Any thoughts on the python-docx stack traces? We don't do anything special with the NLTK_DATA environment variable, that all gets handled by nltk.

scanny commented 1 month ago

I'd be looking for the original partitioning call, especially for the file-type (ODT, DOC, DOCX), to get any real insight.

This error is the one you get when a file-path is provided to python-docx and either:

At the outermost level, a DOCX file is a Zip archive. So if the file isn't a Zip archive it's definitely not a DOCX file.

On the python-docx issues list this most frequently occurs when someone tries to use python-docx for a DOC file (pre-2007 Word file), but there are any number of ways it can happen.

The rest of the stack trace might also narrow it down.

TaylorN15 commented 1 month ago

Thanks for the quick responses. I think I may have been incorrect about the NLTK data, as once we added the rule to our firewall to allow access to GitHub for downloads, I got the error again. I then realised it was caused by trying to download the telemetry package from your servers that was causing the error.

Initially I suspected an issue with NLTK as it worked when I copied the NLTK_data to /home/site (on the app service) but I think my networking guys were also troubleshooting at the same time so it was a false positive.

MthwRobinson commented 1 month ago

Thanks for following up @TaylorN15 !