This PR enables the Python and JS clients to partition PDF pages independently after splitting them on their side (split_pdf_page=True). Splitting is also supported by API itself - this makes sense when users send their requests without using our dedicated clients.
from unstructured_client import UnstructuredClient
from unstructured_client.models import shared
from unstructured_client.models.errors import SDKError
s = UnstructuredClient(api_key_auth=os.environ["UNS_API_KEY"], server_url="http://localhost:8000")
# -- this file is included in this PR --
filename = "sample-docs/DA-1p-with-duplicate-pages.pdf"
with open(filename, "rb") as f:
files = shared.Files(content=f.read(), file_name=filename)
req = shared.PartitionParameters(
files=files,
strategy="fast",
languages=["eng"],
split_pdf_page=False, # this forces splitting on API side (if parallelization is enabled)
# split_pdf_page=True, # forces client-side splitting, implemented here: https://github.com/Unstructured-IO/unstructured-js-client/pull/55
)
resp = s.general.partition(req)
ids = [e["element_id"] for e in resp.elements]
page_numbers = [e["metadata"]["page_number"] for e in resp.elements]
# this PDF contains 3 identical pages, 13 elements each
assert page_numbers == [1,1,1,1,1,1,1,1,1,1,1,1,1,
2,2,2,2,2,2,2,2,2,2,2,2,2,
3,3,3,3,3,3,3,3,3,3,3,3,3]
assert len(ids) == len(set(ids)), "Element IDs are not unique"
This PR enables the Python and JS clients to partition PDF pages independently after splitting them on their side (
split_pdf_page=True
). Splitting is also supported by API itself - this makes sense when users send their requests without using our dedicated clients.Related to:
It should be merged before these:
The tests for this PR won't pass until the related PRs are both merged.
How to test it locally
Unfortunately the
pytest
test is not fully implemented, it fails - see this commentcd unstructured-client; pip install --editable .
make run-web-app
python <script-below>.py