Closed awalker4 closed 1 year ago
This generally looks good!
However, the default behavior should probably be to allow any file type. Users could add additional constraints either inline in def pipeline_api
or by setting an env var, such as you have here.
Assuming that is the case, in a pipeline-notebook such as the unstructured-api one, one could have a cell:
# pipeline-api
DEFAULT_MIMETYPES = "application/epub+zip, ..."
if not os.environ.get("UNSTRUCTURED_ALLOWED_MIMETYPES",None):
os.environ.set("UNSTRUCTURED_ALLOWED_MIMETYPES", DEFAULT_MIMETYPES)
specific to that API.
Awesome, that makes sense. Also means that we can test against the set of partition.auto
file types in unstructured-api without having to duplicate them over here and muddy the waters.
The FastAPI code now has a env variable called
UNSTRUCTURED_ALLOWED_MIMETYPES
, which can be configured at runtime. (The default is allow all types.) Wherever we take a file input, we'll check against this variable and return a 400 response if a type isn't allowed. Sometimes, the client won't send a helpful mimetype - so if we get a genericapplication/octet-stream
we can fall back to the python mimetypes lib, which keys off the file extension. This also gives us better input forfile_content_type
.If we're happy with this approach longer term, it's probably worth documenting the env var somewhere along with how to update it.
Will let #144 get merged first, because I can reuse some of the new test fixtures.