Unstructured-IO / unstructured-api

Apache License 2.0
446 stars 101 forks source link

Frequent 500 errors when enabling parallel mode #357

Closed tylermaran closed 5 months ago

tylermaran commented 6 months ago

Describe the bug I'm getting consistent 500 errors when I enable UNSTRUCTURED_PARALLEL_MODE_ENABLED. Not every request, but about 50% of documents are failing when enabled. I'm sending about 60 pdfs over for parsing (over a ~10 second window). There is a mix of high quality pdfs (no ocr needed) and documents that need to be ocr'd.

Running the same number of documents through without parallel mode succeeds, but some of the OCR documents time out after a couple minutes. When I test parallel mode with a single OCR document, it's roughly 3x faster. But when i try to do a bit more volume it breaks.

To Reproduce

Make about 50 requests with parallel mode enabled and using the auto strategy.

# ENV Vars
UNSTRUCTURED_PARALLEL_MODE_ENABLED=true
UNSTRUCTURED_PARALLEL_MODE_SPLIT_SIZE=1
UNSTRUCTURED_PARALLEL_MODE_THREADS=3
UNSTRUCTURED_PARALLEL_MODE_URL=[my_domain]/general/v0/general
UNSTRUCTURED_PARALLEL_RETRY_ATTEMPTS=2
// Request logs
input data: {"content_type": "application/pdf", "strategy": "auto", "ocr_languages": null, "coordinates": false, "pdf_infer_table_structure": false, "include_page_breaks": false, "encoding": null, "hi_res_model_name": null, "xml_keep_tags": false, "skip_infer_table_types": "pdf", "languages": null, "chunking_strategy": null, "multipage_sections": true, "combine_under_n_chars": null, "new_after_n_chars": null, "max_characters": 500, "extract_image_block_types": null, "extract_image_block_to_payload": false}

Environment:

# AWS Settings
t3.large machine
autoscaling up to 10 instances
2gb ram per instance
nginx timeouts set to 120s

Additional context

Logs I'm getting back. Just a mix of 500s in with the successful requests.

2024-01-27 19:10:11,894 unstructured INFO Processing entire page OCR with tesseract...
--
01/27 11:10:12 | unstructured-api | Version: 11 | 2024-01-27 19:10:12,049 unstructured INFO Processing entire page OCR with tesseract...
01/27 11:10:13 | unstructured-api | Version: 11 | 2024-01-27 19:10:13,618 10.78.164.150:44198 POST /general/v0/general HTTP/1.1 - 200 OK
01/27 11:10:13 | unstructured-api | Version: 11 | 2024-01-27 19:10:13,643 unstructured_api ERROR Expecting value: line 1 column 1 (char 0)
01/27 11:10:13 | unstructured-api | Version: 11 | 2024-01-27 19:10:13,647 10.78.164.150:34572 POST /general/v0/general HTTP/1.1 - 500 Internal Server Error
01/27 11:10:14 | unstructured-api | Version: 11 | 2024-01-27 19:10:14,440 10.78.164.150:53524 POST /general/v0/general HTTP/1.1 - 200 OK
01/27 11:10:14 | unstructured-api | Version: 11 | scripts/app-start.sh: line 5:     7 Killed                  uvicorn prepline_general.api.app:app --log-config logger_config.yaml --host 0.0.0.0
01/27 11:10:14 | unstructured-api | Version: 11 | 2024-01-27 19:10:14,863 unstructured_api ERROR Expecting value: line 1 column 1 (char 0)
01/27 11:10:14 | unstructured-api | Version: 11 | 2024-01-27 19:10:14,864 10.78.82.83:52172 POST /general/v0/general HTTP/1.1 - 500 Internal Server Error
01/27 11:10:14 | unstructured-api | Version: 11 | 2024-01-27 19:10:14,949 10.78.164.150:43358 POST /general/v0/general HTTP/1.1 - 200 OK
01/27 11:10:14 | unstructured-api | Version: 11 | 2024-01-27 19:10:14,963 unstructured_api ERROR Expecting value: line 1 column 1 (char 0)
01/27 11:10:14 | unstructured-api | Version: 11 | 2024-01-27 19:10:14,964 10.78.82.83:60494 POST /general/v0/general HTTP/1.1 - 500 Internal Server Error
awalker4 commented 5 months ago

Hi there, I'm unable to reproduce locally but I do have some thoughts.

Hope this helps!

tylermaran commented 5 months ago

Appreciate the info. Yea it's likely an OOM issue. I already have some retry logic on my API side, so I'll test it out with the UNSTRUCTURED_MEMORY_FREE_MINIMUM_MB var to see when it's running out of memory. In general I've stepped everything up to 16gb and it's not thrown as many errors.

Right now I'm just deploying the latest docer image. But I'll do some local testing and try to trigger the error with additional logs. Probably fine to close the issue for now and I'll add more info if I can reproduce locally.

awalker4 commented 5 months ago

Sounds good! Lmk how it goes.