Unstructured-IO / unstructured-api-tools

Apache License 2.0
28 stars 10 forks source link

Add validation for supported file types in api template #145

Closed awalker4 closed 1 year ago

awalker4 commented 1 year ago

The FastAPI code now has a env variable called UNSTRUCTURED_ALLOWED_MIMETYPES, which can be configured at runtime. (The default is allow all types.) Wherever we take a file input, we'll check against this variable and return a 400 response if a type isn't allowed. Sometimes, the client won't send a helpful mimetype - so if we get a generic application/octet-stream we can fall back to the python mimetypes lib, which keys off the file extension. This also gives us better input for file_content_type.

If we're happy with this approach longer term, it's probably worth documenting the env var somewhere along with how to update it.

Will let #144 get merged first, because I can reuse some of the new test fixtures.

cragwolfe commented 1 year ago

This generally looks good!

However, the default behavior should probably be to allow any file type. Users could add additional constraints either inline in def pipeline_api or by setting an env var, such as you have here.

Assuming that is the case, in a pipeline-notebook such as the unstructured-api one, one could have a cell:

# pipeline-api

DEFAULT_MIMETYPES = "application/epub+zip, ..."

if not os.environ.get("UNSTRUCTURED_ALLOWED_MIMETYPES",None):
    os.environ.set("UNSTRUCTURED_ALLOWED_MIMETYPES", DEFAULT_MIMETYPES)

specific to that API.

awalker4 commented 1 year ago

Awesome, that makes sense. Also means that we can test against the set of partition.auto file types in unstructured-api without having to duplicate them over here and muddy the waters.