Unstructured-IO / unstructured

Open source libraries and APIs to build custom preprocessing pipelines for labeling, training, or production machine learning pipelines.
https://www.unstructured.io/
Apache License 2.0
9.07k stars 749 forks source link

Install `python-magic-bin` instead of `python-magic` for Windows #234

Closed MthwRobinson closed 10 months ago

MthwRobinson commented 1 year ago

Currently windows users have difficulty with file detection because windows needs to install python-magic-bin instead of python-magic. The goal of this issue is to see if we can install python-magic-bin instead of python-magic if the user's OS is Windows.

See this comment for details.

References:

tomaarsen commented 1 year ago

@MthwRobinson I can confirm that python-magic-bin must be installed on Windows. However, it must be noted that the tests do not pass using it. Notably:

FAILED test_unstructured/partition/test_auto.py::test_auto_partition_email_from_file - ValueError: Invalid file. File type not support in partition.
FAILED test_unstructured/partition/test_auto.py::test_auto_partition_html_from_file - ValueError: Invalid file. File type not support in partition.
FAILED test_unstructured/partition/test_auto.py::test_auto_partition_text_from_file - ValueError: Invalid file. File type not support in partition.
FAILED test_unstructured/staging/test_base_staging.py::test_convert_to_isd_serializes_with_posix_paths - NotImplementedError: cannot instantiate 'PosixPath' on your system

(Note: Not an exhaustive list of test failures) For the first three failures, the fake-text.txt, fake-html.html and the fake-email.eml all get detected as application/octet-stream mime type by libmagic, after which unstructured tries to check if it might be a docx, xlsx or pptx. After failing, it assigns the unknown filetype.

Lastly, the posix path simply can't be created on Windows.

I'll open a PR for the last issue.