Unstructured-IO / unstructured

Open source libraries and APIs to build custom preprocessing pipelines for labeling, training, or production machine learning pipelines.
https://www.unstructured.io/
Apache License 2.0
9.21k stars 764 forks source link

fix(filetype): handle missing libmagic library #3790

Open metadaddy opened 16 hours ago

metadaddy commented 16 hours ago

As reported in #3781, the check for availability of the libmagic library is not correct. The existing code checks whether the magic module is available, but the attempt to import magic fails if thelibmagic library is not also available. On the Mac, libmagic is not installed by default; the user must install it manually, typically via brew install libmagic.

This PR detects whether libmagic is installed by importing the magic module in a try block, setting LIBMAGIC_AVAILABLE accordingly. MAGIC_AVAILABLE is set to true if the magic module is installed so that an appropriate warning can be displayed if the fallback filetype module returns None for a mime type. Tests that rely on libmagic being installed are skipped if it is not.

pytest -v test_unstructured/file_utils succeeds:

(.venv) ppatterson@MBP-W7FQ7Y97F0 unstructured % pytest -v test_unstructured/file_utils
============================================== test session starts ==============================================
platform darwin -- Python 3.10.15, pytest-8.3.3, pluggy-1.5.0 -- /Users/ppatterson/src/unstructured/.venv/bin/python3
cachedir: .pytest_cache
rootdir: /Users/ppatterson/src/unstructured
configfile: setup.cfg
plugins: cov-5.0.0, mock-3.14.0, anyio-4.6.2.post1, requests-mock-1.12.1
collected 440 items      
...
================================== 411 passed, 28 skipped, 1 xfailed in 3.68s ===================================