Unstructured-IO / unstructured-api

Apache License 2.0
489 stars 102 forks source link

chore/add hi_res_model_name to partition and deprecate model_name #315

Closed awalker4 closed 7 months ago

awalker4 commented 9 months ago

This api param doesn't match the ones going to the library so we should just fix it.

cragwolfe commented 9 months ago

or add hi_res_model_name as an alias kwarg in partition in the unstructured lib. i think more specific is better. there could be different model names for different "mini pipelines"

Coniferish commented 8 months ago

Returning to this issue... I am unclear on how this issue should be resolved.

Some observations: Neither model_name or hi_res_model_name are used in auto.py partition (they're passed through via **kwargs).

model_name does appear in pdf.py in _partition_pdf_or_image_local(), but it doesn't appear in partition_pdf() or anywhere else in pdf.py.

Both hi_res_model_name and model_name are used by a couple of dataclasses in unstructured/ingest/interfaces.py, but just model_name seems to be used in other files in the unstructured repo.

In unstructured-api, we simply define "model_name": hi_res_model_name,

But we also have odd adjustments like the following to account for the name mismatch:

if file_content_type == "application/pdf" and pdf_parallel_mode_enabled:
    # Be careful of naming differences in api params vs partition params!
    # These kwargs are going back into the api, not into partition
    # They need to be switched back in partition_pdf_splits
    if partition_kwargs.get("model_name"):
        partition_kwargs["hi_res_model_name"] = partition_kwargs.pop("model_name")

    elements = partition_pdf_splits(
        request=request,
        pdf_pages=pdf.pages,
        coordinates=show_coordinates,
        **partition_kwargs,
    )

So, I just want to make sure the desired fix for this to make hi_res_model_name consistent in unstructured-api (and get rid of the need for the redefining like the snippet above) and to add an alias/redefine model_name in the unstructured repo so it doesn't matter if the model argument is defined with model_name or hi_res_model_name. @awalker4 @cragwolfe

cragwolfe commented 8 months ago

to answer it from the requirements perspective: partition() or any partition_ function that current takes a (hires)model_name should accept both the deprecated model_name and preferred hi_res_model_name. That one is deprecated should be clear in the signature of the functions.

Coniferish commented 8 months ago

So, hi_res_model_name is the preferred and model_name should be deprecated (opposite of what this issue originally proposed given the title, yes?)

cragwolfe commented 8 months ago

correct.