Unstructured-IO / unstructured-python-client

A Python client for the Unstructured hosted API
MIT License
74 stars 13 forks source link

bug/v0.25.5 504 Gateway Timeout Error #158

Open JOSHMT0744 opened 3 weeks ago

JOSHMT0744 commented 3 weeks ago

Describe the bug When using v0.25.5 of unstructured-client on vscode, on processing PDFs of more than 1 page with "hi_res", I consistently receive INFO: Failed to process a request due to API server error with status code 504. and consequently:

INFO: Server message - <html>
<head><title>504 Gateway Time-out</title></head>
<body>
<center><h1>504 Gateway Time-out</h1></center>
<hr><center>nginx</center>
</body>
</html>

To Reproduce

import os
from unstructured_client import UnstructuredClient
from unstructured_client.models import shared
from unstructured_client.models.errors import SDKError

os.environ['UNSTRUCTURED_API_KEY'] = "<MY_API_KI>"
os.environ['UNSTRUCTURED_API_URL'] = "<MY_API_URL>"

client_obj = UnstructuredClient(
    api_key_auth=os.getenv("UNSTRUCTURED_API_KEY"),
    server_url=os.getenv("UNSTRUCTURED_API_URL"),
)

filename = "./data/kenwood_en.pdf"
file = open(filename, "rb")
req = shared.PartitionParameters(
    # Note that this currently only supports a single file
    files=shared.Files(
        content=file.read(),
        file_name=filename,
    ),
    chunking_strategy="by_title",
    max_characters=1024,
    split_pdf_page=True,
    split_pdf_allow_failed=True
)

try:
    res = client_obj.general.partition(request=req)
    print(res.elements[0])
except SDKError as e:
    print(e)

Expected behavior After 2 minutes, it will always throw the error:

INFO: Preparing to split document for partition.
INFO: Starting page number set to 1
INFO: Allow failed set to 1
INFO: Concurrency level set to 5
INFO: Splitting pages 1 to 40 (40 total)
INFO: Determined optimal split size of 8 pages.
INFO: Partitioning 5 files with 8 page(s) each.
INFO: Partitioning set #1 (pages 1-8).
INFO: Partitioning set #2 (pages 9-16).
INFO: Partitioning set #3 (pages 17-24).
INFO: Partitioning set #4 (pages 25-32).
INFO: Partitioning set #5 (pages 33-40).
INFO: HTTP Request: POST <MY_API_URL> "HTTP/1.1 504 Gateway Time-out"
ERROR: Failed to send request for page 25
INFO: HTTP Request: POST <MY_API_URL> "HTTP/1.1 504 Gateway Time-out"
ERROR: Failed to send request for page 17
INFO: HTTP Request: POST <MY_API_URL> "HTTP/1.1 504 Gateway Time-out"
ERROR: Failed to send request for page 9
INFO: HTTP Request: POST<MY_API_URL> "HTTP/1.1 504 Gateway Time-out"
ERROR: Failed to send request for page 1
WARNING: Failed to partition set #1, its elements will be omitted in the final result.
WARNING: Failed to partition set #2, its elements will be omitted in the final result.
WARNING: Failed to partition set #3, its elements will be omitted in the final result.
WARNING: Failed to partition set #4, its elements will be omitted in the final result.
WARNING: Failed to partition set #5, its elements will be omitted in the final result.
INFO: Failed to process a request due to API server error with status code 504. Attempting retry number 1 after sleep.
INFO: Server message - <html>
<head><title>504 Gateway Time-out</title></head>
<body>
<center><h1>504 Gateway Time-out</h1></center>
<hr><center>nginx</center>
</body>
</html>

And then it will go about the retry strategy, which I presume is the one defined in general.py. This loop of 504s continues again and again. I have tried adjusting the RetryConfig in my Client and general.Partition, but can't seem to make it make a difference to when and how my program fails.

Environment Info I am running this in a Jupyter notebook in VSCode, within a venv.

Additional Info The pdf I used to reproduce this example is here Would anyone have a solution, or could help guide me as to whether this is a me issue or a bug?