Closed scanny closed 6 months ago
... that test is asserting that the api parameters are properly sent to partition, so I think you'll just need to rename the arg over there as well.
The parameter name in partition()
is correct, that's where the disconnect was. combine_under_n_chars
just got lumped into **kwargs
and hung around unused; combine_text_under_n_chars
then took its default (500).
Actually, partition()
doesn't define that parameter explicitly, it doesn't appear until partition_pdf()
and even there it's not strictly required because partition_pdf()
has **kwargs
as well. It's the @add_chunking_strategy()
decorator that actually goes looking for that specific name.
It's one of the challenges with taking advantage of the **kwargs
behaviors, you lose type checking and then bugs like this one can find a home. I think the way I would solve this if we had the engineering will for it would be to create a PartitionOptions
object that can be created at the earliest opportunity (by first receiver of call) and passed around whole from there on. I'd have to noodle it some more but for now I think we just have to be very diligent about arg names.
@awalker4 how does the versioning work for this repo? Should I be bumping the version to 0.0.62-dev0
in prepline_general/api/app.py
and add my CHANGELOG
entry under that new heading?
If you can help me understand how this ends up appearing in production after merging, that would help me a lot so I can let the end-users know what to expect :)
@scanny The version change is driven by the changelog - so you can add a section there under 0.0.63-dev0. Running make version-sync
will then replace the 0.0.62 string in the code. The endpoint will then live at /general/v0/general
and /general/v0.0.63/general
. Will dm you with deployment details.
This is the minimal fix for the
combine_text_under_n_chars
arg-name typo bug described in #337.Chunking parameters were added to the API about 4 months ago in this commit: https://github.com/Unstructured-IO/unstructured-api/commit/49fe71026e026530276bbb72f6a9df46feb6d1c2
At that time, the identifier
combine_under_n_chars
was used instead ofcombine_text_under_n_chars
(as that parameter appears inchunk_by_title()
). As a result, any value supplied for this argument via the REST API is ignored and defaults to 500. Thecombine_under_n_chars
parameter name appears in the API documentation and elsewhere (README.md etc.) and differs from the parameter name documented forchunk_by_title()
.A fuller remedy will be to add a separate
combine_text_under_n_chars
parameter to the API endpoint such that either name is accepted for now, then update the documentation such that all mentions arecombine_text_under_n_chars
. At some point in the future we may remove thecombine_under_n_chars
parameter from the endpoint.However, making those changes requires a synchronized release of the
unstructured
library and in the interest of time we're making this "quick-fix" that can be released independently of any other dependencies.