Closed TaylorN15 closed 1 month ago
@scanny - Any thoughts on the python-docx
stack traces? We don't do anything special with the NLTK_DATA
environment variable, that all gets handled by nltk
.
I'd be looking for the original partitioning call, especially for the file-type (ODT, DOC, DOCX), to get any real insight.
This error is the one you get when a file-path is provided to python-docx
and either:
At the outermost level, a DOCX file is a Zip archive. So if the file isn't a Zip archive it's definitely not a DOCX file.
On the python-docx
issues list this most frequently occurs when someone tries to use python-docx
for a DOC file (pre-2007 Word file), but there are any number of ways it can happen.
The rest of the stack trace might also narrow it down.
Thanks for the quick responses. I think I may have been incorrect about the NLTK data, as once we added the rule to our firewall to allow access to GitHub for downloads, I got the error again. I then realised it was caused by trying to download the telemetry package from your servers that was causing the error.
Initially I suspected an issue with NLTK as it worked when I copied the NLTK_data to /home/site (on the app service) but I think my networking guys were also troubleshooting at the same time so it was a false positive.
Thanks for following up @TaylorN15 !
Describe the bug I'm not sure if this is an issue with
unstructured
ornltk
...I am running on Azure Functions in an App Service Environment which is within an internal network and all outbound traffic is blocked and allowed by exception only. I have downloaded the required NLTK packages and stored then with the functions code, and set an environment variable for NLTK_DATA on the app config. But it still tries to download the NLTK packages and times out (fails). If I (manually) copy the nltk_data folder to ~/nltk_data/ it works, but this is not viable as this directory is volatile.
To Reproduce Block access to nltk.org, run any partition function that requires NLTK packages.
Expected behavior The code should check environment variable NLTK_DATA
After timing out, I assume its trying to unzip the downloaded NLTK data, which doesn't exist. The stack trace indicates an issue with
python-docx
but I couldn't find NLTK referenced in there.