Frequent 500 errors when enabling parallel mode

tylermaran commented 6 months ago

Describe the bug I'm getting consistent 500 errors when I enable UNSTRUCTURED_PARALLEL_MODE_ENABLED. Not every request, but about 50% of documents are failing when enabled. I'm sending about 60 pdfs over for parsing (over a ~10 second window). There is a mix of high quality pdfs (no ocr needed) and documents that need to be ocr'd.

Running the same number of documents through without parallel mode succeeds, but some of the OCR documents time out after a couple minutes. When I test parallel mode with a single OCR document, it's roughly 3x faster. But when i try to do a bit more volume it breaks.

To Reproduce

Make about 50 requests with parallel mode enabled and using the auto strategy.

# ENV Vars
UNSTRUCTURED_PARALLEL_MODE_ENABLED=true
UNSTRUCTURED_PARALLEL_MODE_SPLIT_SIZE=1
UNSTRUCTURED_PARALLEL_MODE_THREADS=3
UNSTRUCTURED_PARALLEL_MODE_URL=[my_domain]/general/v0/general
UNSTRUCTURED_PARALLEL_RETRY_ATTEMPTS=2

// Request logs
input data: {"content_type": "application/pdf", "strategy": "auto", "ocr_languages": null, "coordinates": false, "pdf_infer_table_structure": false, "include_page_breaks": false, "encoding": null, "hi_res_model_name": null, "xml_keep_tags": false, "skip_infer_table_types": "pdf", "languages": null, "chunking_strategy": null, "multipage_sections": true, "combine_under_n_chars": null, "new_after_n_chars": null, "max_characters": 500, "extract_image_block_types": null, "extract_image_block_to_payload": false}

Environment:

Self hosted
Calling the API from my application

# AWS Settings
t3.large machine
autoscaling up to 10 instances
2gb ram per instance
nginx timeouts set to 120s

Additional context

Logs I'm getting back. Just a mix of 500s in with the successful requests.

2024-01-27 19:10:11,894 unstructured INFO Processing entire page OCR with tesseract...
--
01/27 11:10:12 | unstructured-api | Version: 11 | 2024-01-27 19:10:12,049 unstructured INFO Processing entire page OCR with tesseract...
01/27 11:10:13 | unstructured-api | Version: 11 | 2024-01-27 19:10:13,618 10.78.164.150:44198 POST /general/v0/general HTTP/1.1 - 200 OK
01/27 11:10:13 | unstructured-api | Version: 11 | 2024-01-27 19:10:13,643 unstructured_api ERROR Expecting value: line 1 column 1 (char 0)
01/27 11:10:13 | unstructured-api | Version: 11 | 2024-01-27 19:10:13,647 10.78.164.150:34572 POST /general/v0/general HTTP/1.1 - 500 Internal Server Error
01/27 11:10:14 | unstructured-api | Version: 11 | 2024-01-27 19:10:14,440 10.78.164.150:53524 POST /general/v0/general HTTP/1.1 - 200 OK
01/27 11:10:14 | unstructured-api | Version: 11 | scripts/app-start.sh: line 5:     7 Killed                  uvicorn prepline_general.api.app:app --log-config logger_config.yaml --host 0.0.0.0
01/27 11:10:14 | unstructured-api | Version: 11 | 2024-01-27 19:10:14,863 unstructured_api ERROR Expecting value: line 1 column 1 (char 0)
01/27 11:10:14 | unstructured-api | Version: 11 | 2024-01-27 19:10:14,864 10.78.82.83:52172 POST /general/v0/general HTTP/1.1 - 500 Internal Server Error
01/27 11:10:14 | unstructured-api | Version: 11 | 2024-01-27 19:10:14,949 10.78.164.150:43358 POST /general/v0/general HTTP/1.1 - 200 OK
01/27 11:10:14 | unstructured-api | Version: 11 | 2024-01-27 19:10:14,963 unstructured_api ERROR Expecting value: line 1 column 1 (char 0)
01/27 11:10:14 | unstructured-api | Version: 11 | 2024-01-27 19:10:14,964 10.78.82.83:60494 POST /general/v0/general HTTP/1.1 - 500 Internal Server Error

awalker4 commented 5 months ago

Hi there, I'm unable to reproduce locally but I do have some thoughts.

The Killed message in here makes me think OOM may be involved. For those docs that do need ocr, we suggest 16GB min for memory. For your mixed workload you may not need this much, but I'd suggest stepping the size up and seeing if this helps. Also note the variable here that you can set to return 503s and help mitigate an OOM kill. This would complement parallel mode's RETRY_ATTEMPTS, and hopefully bounce the request to an instance with less load.
The Expecting value may be from the "controller" server after the worker has just died. I think I've seen this in rare cases where we try to parse an empty http response here. If you're able to modify the code before trying again, perhaps you can add some extra debugging here.

Hope this helps!

tylermaran commented 5 months ago

Appreciate the info. Yea it's likely an OOM issue. I already have some retry logic on my API side, so I'll test it out with the UNSTRUCTURED_MEMORY_FREE_MINIMUM_MB var to see when it's running out of memory. In general I've stepped everything up to 16gb and it's not thrown as many errors.

Right now I'm just deploying the latest docer image. But I'll do some local testing and try to trigger the error with additional logs. Probably fine to close the issue for now and I'll add more info if I can reproduce locally.

awalker4 commented 5 months ago

Sounds good! Lmk how it goes.

Unstructured-IO / unstructured-api

Frequent 500 errors when enabling parallel mode #357