The purpose of this PR is to support parsing all list type parameters, including extract_image_block_types when calling unstructured API via unstructured client SDK (Python/JS) generated by speakeasy.

Currently, the speakeasy doesn't generate proper client code to pass list type parameters to unstructured API because they do not expect to support specific client code for FastAPI that the unstructured API relies on. To address this issue, I updated the unstructured API code to parse all list type parameters passed as JSON-formatted lists (e.g. '["image", "table"]').

NOTE: You must pass the list type parameter as a JSON-formatted list when calling unstructured API via unstructured client SDK. (e.g. extract_image_block_types = '["image", "table"]', skip_infer_table_types='["docx", "xlsx"]'...)

Summary

update SmartValueParser.value_or_first_element() to parse JSON format string (e.g. '["image", "table"]') that is convertible to a list
apply SmartValueParser.value_or_first_element() to all list type parameters
remove existing extract_image_block_types parsing logic

Testing

via unstructured_client_sdk (Python)


s = UnstructuredClient(
server_url="http://localhost:8000/general/v0/general",
api_key_auth="YOUR-API-KEY"
)

filename = "sample-docs/embedded-images-tables.pdf"

with open(filename, "rb") as f:

Note that this currently only supports a single file

files = shared.Files(
    content=f.read(),
    file_name=filename,
)

req = shared.PartitionParameters( files=files,

Other partition params

strategy="hi_res",
extract_image_block_types='["image", "table"]',
languages=["pdf"],

)

try: resp = s.general.partition(req) print([el.get("metadata").get("image_mime_type") for el in resp.elements if el.get("metadata").get("image_mime_type")]) except SDKError as e: print(e)


- via unstructured_client_sdk (JS)

import { UnstructuredClient } from "unstructured-client"; import * as fs from "fs";

const key = "YOUR-API-KEY";

const client = new UnstructuredClient({ serverURL: "http://localhost:8000", security: { apiKeyAuth: key, }, });

const filename = "sample-docs/embedded-images-tables.pdf"; const data = fs.readFileSync(filename);

client.general.partition({ // Note that this currently only supports a single file files: { content: data, fileName: filename, }, // Other partition params strategy: "hi_res", extractImageBlockTypes: '["image", "table"]', }).then((res) => { if (res.statusCode == 200) { console.log(res.elements); } }).catch((e) => { console.log(e.statusCode); console.log(e.body); });

- via default `requests` client (Python)

url = "http://localhost:8000/general/v0/general"

headers = { 'accept': 'application/json', 'unstructured-api-key': "YOUR-API-KEY" }

data = { "strategy": "hi_res", "extract_image_block_types": ["Image", "Table"], }

filename = "sample-docs/embedded-images-tables.pdf" file_data = {'files': open(filename, 'rb')}

response = requests.post(url, headers=headers, data=data, files=file_data)

file_data['files'].close()

elements = response.json() print([el.get("metadata").get("image_mime_type") for el in elements if el.get("metadata").get("image_mime_type")])


- via `curl` command

curl -X 'POST' \ 'http://localhost:8000/general/v0/general' \ -H 'accept: application/json' \ -H 'Content-Type: multipart/form-data' \ -F 'files=@sample-docs/embedded-images-tables.pdf' \ -F 'strategy=hi_res' \ -F 'extract_image_block_types=["image", "table"]' \ | jq -C . | less -R

Unstructured-IO / unstructured-api

feat: support list type parameters #368

Summary

Testing

Note that this currently only supports a single file

Other partition params