Unstructured-IO / unstructured-api

Apache License 2.0
508 stars 108 forks source link

feat: support list type parameters #368

Closed christinestraub closed 7 months ago

christinestraub commented 7 months ago

The purpose of this PR is to support parsing all list type parameters, including extract_image_block_types when calling unstructured API via unstructured client SDK (Python/JS) generated by speakeasy.

Currently, the speakeasy doesn't generate proper client code to pass list type parameters to unstructured API because they do not expect to support specific client code for FastAPI that the unstructured API relies on. To address this issue, I updated the unstructured API code to parse all list type parameters passed as JSON-formatted lists (e.g. '["image", "table"]').

NOTE: You must pass the list type parameter as a JSON-formatted list when calling unstructured API via unstructured client SDK. (e.g. extract_image_block_types = '["image", "table"]', skip_infer_table_types='["docx", "xlsx"]'...)

Summary

Testing

filename = "sample-docs/embedded-images-tables.pdf"

with open(filename, "rb") as f:

Note that this currently only supports a single file

files = shared.Files(
    content=f.read(),
    file_name=filename,
)

req = shared.PartitionParameters( files=files,

Other partition params

strategy="hi_res",
extract_image_block_types='["image", "table"]',
languages=["pdf"],

)

try: resp = s.general.partition(req) print([el.get("metadata").get("image_mime_type") for el in resp.elements if el.get("metadata").get("image_mime_type")]) except SDKError as e: print(e)


- via unstructured_client_sdk (JS)

import { UnstructuredClient } from "unstructured-client"; import * as fs from "fs";

const key = "YOUR-API-KEY";

const client = new UnstructuredClient({ serverURL: "http://localhost:8000", security: { apiKeyAuth: key, }, });

const filename = "sample-docs/embedded-images-tables.pdf"; const data = fs.readFileSync(filename);

client.general.partition({ // Note that this currently only supports a single file files: { content: data, fileName: filename, }, // Other partition params strategy: "hi_res", extractImageBlockTypes: '["image", "table"]', }).then((res) => { if (res.statusCode == 200) { console.log(res.elements); } }).catch((e) => { console.log(e.statusCode); console.log(e.body); });

- via default `requests` client (Python)

url = "http://localhost:8000/general/v0/general"

headers = { 'accept': 'application/json', 'unstructured-api-key': "YOUR-API-KEY" }

data = { "strategy": "hi_res", "extract_image_block_types": ["Image", "Table"], }

filename = "sample-docs/embedded-images-tables.pdf" file_data = {'files': open(filename, 'rb')}

response = requests.post(url, headers=headers, data=data, files=file_data)

file_data['files'].close()

elements = response.json() print([el.get("metadata").get("image_mime_type") for el in elements if el.get("metadata").get("image_mime_type")])


- via `curl` command

curl -X 'POST' \ 'http://localhost:8000/general/v0/general' \ -H 'accept: application/json' \ -H 'Content-Type: multipart/form-data' \ -F 'files=@sample-docs/embedded-images-tables.pdf' \ -F 'strategy=hi_res' \ -F 'extract_image_block_types=["image", "table"]' \ | jq -C . | less -R