Unstructured-IO / unstructured-api

Apache License 2.0
446 stars 101 forks source link

Is enabling "parallel mode" only recommended for `hi_res` strategy? #378

Closed omikader closed 4 months ago

omikader commented 5 months ago

Describe the bug

I'm playing around with parallel mode and the fast strategy and I was surprised to notice that it took longer to partition my PDF. Is this expected? Is parallel mode only recommended when using hi_res mode?

% Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 26.1M  100 3199k  100 23.0M   103k   763k  0:00:30  0:00:30 --:--:--  767k
curl -O -X 'POST' 'http://localhost:8000/general/v0/general' -H  -H  -F  -F    0.01s user 0.03s system 0% cpu 30.936 total
% Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 26.2M  100 3267k  100 23.0M  53379   376k  0:01:02  0:01:02 --:--:--  878k
curl -O -X 'POST' 'http://localhost:8000/general/v0/general' -H  -H  -F  -F    0.01s user 0.03s system 0% cpu 1:02.71 total

To Reproduce

Environment:

I'm running unstructured-api as a Docker container on my local machine

omar@Omars-MacBook-Pro % docker run -p 8000:8000 -d --rm --name unstructured-api \
-e UNSTRUCTURED_PARALLEL_MODE_ENABLED='true' \
-e UNSTRUCTURED_PARALLEL_MODE_URL='http://127.0.0.1:8000/general/v0/general' \
downloads.unstructured.io/unstructured-io/unstructured-api:latest \
--port 8000 --host 0.0.0.0

All requests are made using cURL


omar@Omars-MacBook-Pro % time curl -O -X 'POST' \
  'http://localhost:8000/general/v0/general' \
  -H 'accept: application/json' \
  -H 'Content-Type: multipart/form-data' \
  -F 'files=@Consolidated Set of Standards.pdf' \
  -F 'chunking_strategy=by_title' \
  -F 'strategy=fast'
omikader commented 5 months ago

Seems like the answer is yes! I just tested the same scenario but with hi_res and saw that parallel mode took 19 minutes compared to 41 minutes without it. It appears that the overhead of the file splitting/consolidation hurts performance for the fast strategy

Leaving this issue open just in case it helps someone else and leads to more explicit guidance in the README (e.g. "This mode is only recommended when using the hi_res strategy")

omikader commented 4 months ago

Hi @awalker4! Would love to get your take on this, if possible. Thank you!

awalker4 commented 4 months ago

Hi @omikader, sorry for the delay. Correct - parallel mode will have a huge speedup for hi_res, but will otherwise just add overhead. The library code for hi_res is all serialized, and we needed a way to split up the work without redesigning the whole library. The approach of splitting the file and sending out another batch of api requests was a simple way to get the load balancer to do the scaling for us. Since hi_res pdf are so cpu heavy, this unlocks a huge speedup (it's all those Tesseract calls!) Any other filetype/strategy will be done long before the pdf gets split up.

Thanks for calling this out - I'll add a note to the readme. Or, if you have a moment, a pr would be a huge help :)

awalker4 commented 4 months ago

Also note that we're pushing to do pdf splitting on the client these days - we don't actually have parallel mode set on our server anymore. We've basically reimplemented this logic in the python client.

omikader commented 4 months ago

@awalker4 done! See https://github.com/Unstructured-IO/unstructured-api/pull/395.

And thanks for the tip about client-side splitting! I'm currently using the JS client so looking forward to see that supported over there soon 🙂