Unstructured-IO / unstructured-api

Apache License 2.0
489 stars 102 forks source link

chore: Roll our openapi.json back into FastAPI #317

Closed awalker4 closed 7 months ago

awalker4 commented 9 months ago

The openapi.json in this repo is the source of truth for our Speakeasy client. We want to roll these changes back into FastAPI so they get into our hosted Swagger docs. The Speakeasy team has given us a diff to get started, so we just need to get this merged.

Note that many of our parameters are lists when they shouldn't be. Much of the complexity here is working around that in the generated spec, so it might be cleaner to just fix the params first. Goal for this ticket should just be to apply the linked diff. We'll address the param types elsewhere.

cragwolfe commented 9 months ago

Note that many of our parameters are lists when they shouldn't be.

however, i believe at this point we need to maintain backwards compat of for pretty much all of the existing params.

g-parki commented 9 months ago

Hi @awalker4, I came across the "everything in the docs is a list" issue and I poked around to see if Pydantic was used or could be implemented/updated. Is there some history there? I see Pydantic listed in base.in, but I don't see it actually being imported (see grep below).

I'd be happy to contribute a PR for request models if that would be welcome.

grep -r "pydantic" .
./requirements/base.in:pydantic<2.0.2
./requirements/test.txt:pydantic==1.10.13
./requirements/test.txt:    #   pydantic
./requirements/base.txt:pydantic==1.10.13
./requirements/base.txt:    #   pydantic
./requirements/base.txt:    #   pydantic-core
./prepline_general/api/general.py:    # Note(austin): pydantic should control this sort of thing for us
g-parki commented 9 months ago

I'd be happy to contribute a PR for request models if that would be welcome.

Actually I see the request is form data, which I don't think can be neatly wrapped in a Pydantic model. For backward compatibility with the array'd params, a Pydantic annotated validator could be used to take only the first element of the list as needed. The Swagger docs and the OpenAPI schema would only show the new types, so the backward compatibility would be undocumented.

T = TypeVar("T")

def first_element_of_list(value: Union[T, list[T]]) -> T:
    if isinstance(value, list) and len(value) > 0:
        return value[0]
    return value

@router.post("/general/v0/general")
def pipeline_1(
    ...
    strategy: Annotated[
        Literal["fast", "hi_res", "auto"],
        Form(
            title="Strategy",
            description="The strategy to use for partitioning PDF/image. Options are fast, hi_res, auto. Default: auto",
            examples=["auto", "hi_res"],
        ),
        BeforeValidator(first_element_of_list),
    ] = "auto",
    ...
)
    ...
image
awalker4 commented 8 months ago

A PR for this would be great! For context, we used to autogenerate this repo through unstructured-api-tools. That project generates FastAPI code from a Jupyter notebook, for quickly spinning up data pipelines. unstructured-api has outgrown api-tools, and many of those conventions aren't useful here.

awalker4 commented 7 months ago

Closed with #359