Unstructured-IO / unstructured

Open source libraries and APIs to build custom preprocessing pipelines for labeling, training, or production machine learning pipelines.
https://www.unstructured.io/
Apache License 2.0
7.44k stars 580 forks source link

fix(docx): refine file-not-found vs not-DOCX #3317

Closed scanny closed 2 days ago

scanny commented 5 days ago

Summary Make small refinement to distinguishing file-not-found from not-a-DOCX error message.

Additional Context Conversion of a SpooledTemporaryFile to io.BytesIO must happen (when necessary) before testing whether a file-like object is a zipfile or not.

Move that check and conversion into ._validate() so it's done before checking that a file-like object is a zip archive.

scanny commented 2 days ago

Turned out to be unnecessary. zipfile.is_zipfile() works fine on a SpooledTemporaryFile (even though zipfile won't open one). So no need to convert SpooledTemporaryFile in ._validate().