Unstructured-IO / unstructured

Open source libraries and APIs to build custom preprocessing pipelines for labeling, training, or production machine learning pipelines.
https://www.unstructured.io/
Apache License 2.0
7.44k stars 580 forks source link

feat(docx): differentiate no-file from not-ZIP #3306

Closed scanny closed 6 days ago

scanny commented 6 days ago

Summary The python-docx error docx.opc.exceptions.PackageNotFoundError arises both when no file exists at the given path and when the file exists but is not a ZIP archive (and so is not a DOCX file).

This ambiguity is unwelcome when diagnosing the error as the two possible conditions generally indicate a different course of action to resolve the error.

Add detailed validation to DocxPartitionerOptions to distinguish these two and provide more precise exception messages.

Additional Context

sentry-io[bot] commented 1 day ago

Suspect Issues

This pull request was deployed and Sentry observed the following issues:

Did you find this useful? React with a 👍 or 👎