Closed awalker4 closed 7 months ago
or add hi_res_model_name
as an alias kwarg in partition
in the unstructured lib. i think more specific is better. there could be different model names for different "mini pipelines"
Returning to this issue... I am unclear on how this issue should be resolved.
Some observations:
Neither model_name
or hi_res_model_name
are used in auto.py partition
(they're passed through via **kwargs
).
model_name
does appear in pdf.py
in _partition_pdf_or_image_local()
, but it doesn't appear in partition_pdf()
or anywhere else in pdf.py
.
Both hi_res_model_name
and model_name
are used by a couple of dataclasses in unstructured/ingest/interfaces.py, but just model_name
seems to be used in other files in the unstructured
repo.
In unstructured-api
, we simply define "model_name": hi_res_model_name,
But we also have odd adjustments like the following to account for the name mismatch:
if file_content_type == "application/pdf" and pdf_parallel_mode_enabled:
# Be careful of naming differences in api params vs partition params!
# These kwargs are going back into the api, not into partition
# They need to be switched back in partition_pdf_splits
if partition_kwargs.get("model_name"):
partition_kwargs["hi_res_model_name"] = partition_kwargs.pop("model_name")
elements = partition_pdf_splits(
request=request,
pdf_pages=pdf.pages,
coordinates=show_coordinates,
**partition_kwargs,
)
So, I just want to make sure the desired fix for this to make hi_res_model_name
consistent in unstructured-api
(and get rid of the need for the redefining like the snippet above) and to add an alias/redefine model_name
in the unstructured
repo so it doesn't matter if the model argument is defined with model_name
or hi_res_model_name
.
@awalker4 @cragwolfe
to answer it from the requirements perspective: partition()
or any partition_
function that current takes a (hires)model_name should accept both the deprecated model_name
and preferred hi_res_model_name
. That one is deprecated should be clear in the signature of the functions.
So, hi_res_model_name
is the preferred and model_name
should be deprecated (opposite of what this issue originally proposed given the title, yes?)
correct.
This api param doesn't match the ones going to the library so we should just fix it.