The purpose of this PR is to support parsing all list type parameters, including extract_image_block_types when calling unstructured API via unstructured client SDK (Python/JS) generated by speakeasy.
Currently, the speakeasy doesn't generate proper client code to pass list type parameters to unstructured API because they do not expect to support specific client code for FastAPI that the unstructured API relies on. To address this issue, I updated the unstructured API code to parse all list type parameters passed as JSON-formatted lists (e.g. '["image", "table"]').
NOTE: You must pass the list type parameter as a JSON-formatted list when calling unstructured API via unstructured client SDK.
(e.g. extract_image_block_types = '["image", "table"]', skip_infer_table_types='["docx", "xlsx"]'...)
Summary
update SmartValueParser.value_or_first_element() to parse JSON format string (e.g. '["image", "table"]') that is convertible to a list
apply SmartValueParser.value_or_first_element() to all list type parameters
try:
resp = s.general.partition(req)
print([el.get("metadata").get("image_mime_type") for el in resp.elements if el.get("metadata").get("image_mime_type")])
except SDKError as e:
print(e)
- via unstructured_client_sdk (JS)
import { UnstructuredClient } from "unstructured-client";
import * as fs from "fs";
The purpose of this PR is to support parsing all list type parameters, including
extract_image_block_types
when calling unstructured API via unstructured client SDK (Python/JS) generated byspeakeasy
.Currently, the
speakeasy
doesn't generate proper client code to pass list type parameters to unstructured API because they do not expect to support specific client code forFastAPI
that the unstructured API relies on. To address this issue, I updated the unstructured API code to parse all list type parameters passed as JSON-formatted lists (e.g.'["image", "table"]'
).NOTE: You must pass the list type parameter as a JSON-formatted list when calling unstructured API via unstructured client SDK. (e.g.
extract_image_block_types = '["image", "table"]'
,skip_infer_table_types='["docx", "xlsx"]'
...)Summary
SmartValueParser.value_or_first_element()
to parse JSON format string (e.g.'["image", "table"]'
) that is convertible to a listSmartValueParser.value_or_first_element()
to all list type parametersextract_image_block_types
parsing logicTesting
filename = "sample-docs/embedded-images-tables.pdf"
with open(filename, "rb") as f:
Note that this currently only supports a single file
req = shared.PartitionParameters( files=files,
Other partition params
)
try: resp = s.general.partition(req) print([el.get("metadata").get("image_mime_type") for el in resp.elements if el.get("metadata").get("image_mime_type")]) except SDKError as e: print(e)
import { UnstructuredClient } from "unstructured-client"; import * as fs from "fs";
const key = "YOUR-API-KEY";
const client = new UnstructuredClient({ serverURL: "http://localhost:8000", security: { apiKeyAuth: key, }, });
const filename = "sample-docs/embedded-images-tables.pdf"; const data = fs.readFileSync(filename);
client.general.partition({ // Note that this currently only supports a single file files: { content: data, fileName: filename, }, // Other partition params strategy: "hi_res", extractImageBlockTypes: '["image", "table"]', }).then((res) => { if (res.statusCode == 200) { console.log(res.elements); } }).catch((e) => { console.log(e.statusCode); console.log(e.body); });
url = "http://localhost:8000/general/v0/general"
headers = { 'accept': 'application/json', 'unstructured-api-key': "YOUR-API-KEY" }
data = { "strategy": "hi_res", "extract_image_block_types": ["Image", "Table"], }
filename = "sample-docs/embedded-images-tables.pdf" file_data = {'files': open(filename, 'rb')}
response = requests.post(url, headers=headers, data=data, files=file_data)
file_data['files'].close()
elements = response.json() print([el.get("metadata").get("image_mime_type") for el in elements if el.get("metadata").get("image_mime_type")])
curl -X 'POST' \ 'http://localhost:8000/general/v0/general' \ -H 'accept: application/json' \ -H 'Content-Type: multipart/form-data' \ -F 'files=@sample-docs/embedded-images-tables.pdf' \ -F 'strategy=hi_res' \ -F 'extract_image_block_types=["image", "table"]' \ | jq -C . | less -R