Unstructured-IO / unstructured-api

Apache License 2.0
429 stars 94 forks source link

fix/Fix MS Office filetype errors and harden docker smoketest #436

Closed awalker4 closed 1 week ago

awalker4 commented 1 week ago

Changes

Fix for docx and other office files returning {"detail":"File type None is not supported."} After moving to the wolfi base image, the mimetypes lib no longer knows about these file extensions. To avoid issues like this, let's add an explicit mapping for all the file extensions we care about. I added a filetypes.py and moved get_validated_mimetype over. When this file is imported, we'll call mimetypes.add_type for all file extensions we support.

Update smoke test coverage This bug snuck past because we were already providing the mimetype in the docker smoke test. I updated test_happy_path to test against the container with and without passing content_type. I added some missing filetypes, and sorted the test params by extension so we can see when new types are missing.

Testing

The new smoke test will verify that all filetypes are working. You can also make docker-build && make docker-start-api, and test out the docx in the sample docs dir. On main, this file will give you the error above.

curl 'http://localhost:8000/general/v0/general' \
--form 'files=@"fake.docx"'