Unstructured-IO / unstructured-api

Apache License 2.0
446 stars 101 forks source link

fix(chunking): minimal fix for combine arg typo #345

Closed scanny closed 6 months ago

scanny commented 6 months ago

This is the minimal fix for the combine_text_under_n_chars arg-name typo bug described in #337.

Chunking parameters were added to the API about 4 months ago in this commit: https://github.com/Unstructured-IO/unstructured-api/commit/49fe71026e026530276bbb72f6a9df46feb6d1c2

At that time, the identifier combine_under_n_chars was used instead of combine_text_under_n_chars (as that parameter appears in chunk_by_title()). As a result, any value supplied for this argument via the REST API is ignored and defaults to 500. The combine_under_n_chars parameter name appears in the API documentation and elsewhere (README.md etc.) and differs from the parameter name documented for chunk_by_title().

A fuller remedy will be to add a separate combine_text_under_n_chars parameter to the API endpoint such that either name is accepted for now, then update the documentation such that all mentions are combine_text_under_n_chars. At some point in the future we may remove the combine_under_n_chars parameter from the endpoint.

However, making those changes requires a synchronized release of the unstructured library and in the interest of time we're making this "quick-fix" that can be released independently of any other dependencies.

scanny commented 6 months ago

... that test is asserting that the api parameters are properly sent to partition, so I think you'll just need to rename the arg over there as well.

The parameter name in partition() is correct, that's where the disconnect was. combine_under_n_chars just got lumped into **kwargs and hung around unused; combine_text_under_n_chars then took its default (500).

Actually, partition() doesn't define that parameter explicitly, it doesn't appear until partition_pdf() and even there it's not strictly required because partition_pdf() has **kwargs as well. It's the @add_chunking_strategy() decorator that actually goes looking for that specific name.

It's one of the challenges with taking advantage of the **kwargs behaviors, you lose type checking and then bugs like this one can find a home. I think the way I would solve this if we had the engineering will for it would be to create a PartitionOptions object that can be created at the earliest opportunity (by first receiver of call) and passed around whole from there on. I'd have to noodle it some more but for now I think we just have to be very diligent about arg names.

scanny commented 6 months ago

@awalker4 how does the versioning work for this repo? Should I be bumping the version to 0.0.62-dev0 in prepline_general/api/app.py and add my CHANGELOG entry under that new heading?

If you can help me understand how this ends up appearing in production after merging, that would help me a lot so I can let the end-users know what to expect :)

awalker4 commented 6 months ago

@scanny The version change is driven by the changelog - so you can add a section there under 0.0.63-dev0. Running make version-sync will then replace the 0.0.62 string in the code. The endpoint will then live at /general/v0/general and /general/v0.0.63/general. Will dm you with deployment details.