Unstructured-IO / unstructured-api

Apache License 2.0
429 stars 94 forks source link

Support for `starting_page_number` parameter when doing PDF page splitting #400

Closed micmarty-deepsense closed 2 months ago

micmarty-deepsense commented 3 months ago

This PR enables the Python and JS clients to partition PDF pages independently after splitting them on their side (split_pdf_page=True). Splitting is also supported by API itself - this makes sense when users send their requests without using our dedicated clients.

Related to:

It should be merged before these:

The tests for this PR won't pass until the related PRs are both merged.

How to test it locally

Unfortunately the pytest test is not fully implemented, it fails - see this comment

  1. Clone Python client and checkout to this PR: https://github.com/Unstructured-IO/unstructured-js-client/pull/55
  2. cd unstructured-client; pip install --editable .
  3. make run-web-app
  4. python <script-below>.py
from unstructured_client import UnstructuredClient
from unstructured_client.models import shared
from unstructured_client.models.errors import SDKError

s = UnstructuredClient(api_key_auth=os.environ["UNS_API_KEY"], server_url="http://localhost:8000")

# -- this file is included in this PR --
filename = "sample-docs/DA-1p-with-duplicate-pages.pdf"
with open(filename, "rb") as f:
    files = shared.Files(content=f.read(), file_name=filename)

req = shared.PartitionParameters(
    files=files,
    strategy="fast",
    languages=["eng"],
    split_pdf_page=False, # this forces splitting on API side (if parallelization is enabled)
    # split_pdf_page=True,  # forces client-side splitting, implemented here: https://github.com/Unstructured-IO/unstructured-js-client/pull/55
)
resp = s.general.partition(req)
ids = [e["element_id"] for e in resp.elements]
page_numbers = [e["metadata"]["page_number"] for e in resp.elements]
# this PDF contains 3 identical pages, 13 elements each
assert page_numbers == [1,1,1,1,1,1,1,1,1,1,1,1,1,
                        2,2,2,2,2,2,2,2,2,2,2,2,2,
                        3,3,3,3,3,3,3,3,3,3,3,3,3]
assert len(ids) == len(set(ids)), "Element IDs are not unique"