Unstructured-IO / unstructured-api

Apache License 2.0
489 stars 102 forks source link

max_characters param not accessible via pipeline_api #297

Closed macahi closed 10 months ago

macahi commented 10 months ago

Describe the bug

Hello,

I am currently testing the unstructured-api using the chunking_strategy=by_title. I noticed that the max_characters parameter for the chunk_by_title method cannot be passed via pipeline_api: https://github.com/Unstructured-IO/unstructured-api/blob/c91d1b9966f5281344fe5d2e662b94ea3aa2e46d/prepline_general/api/general.py#L302 As a result, it's not possible to specify values for new_after_n_chars that exceed the default value of max_characters (500).

To Reproduce

curl -X 'POST' 
 'https://api.unstructured.io/general/v0/general' \
 -H 'accept: application/json'  \
 -H 'Content-Type: multipart/form-data' \
 -F 'files=@sample-docs/layout-parser-paper-fast.pdf' \
 -F 'chunking_strategy=by_title' 
 -F 'new_after_n_chars=1500' 

new_after_n_chars has no effect; the maximum chunk size is 500.

If I'm correct, it should be easy to fix.

awalker4 commented 10 months ago

Hi there, this will certainly be a quick fix. We'll keep you posted!

awalker4 commented 10 months ago

This is now deployed in the hosted api