Closed omikader closed 4 months ago
Seems like the answer is yes! I just tested the same scenario but with hi_res
and saw that parallel mode took 19 minutes compared to 41 minutes without it. It appears that the overhead of the file splitting/consolidation hurts performance for the fast
strategy
Leaving this issue open just in case it helps someone else and leads to more explicit guidance in the README (e.g. "This mode is only recommended when using the hi_res
strategy")
Hi @awalker4! Would love to get your take on this, if possible. Thank you!
Hi @omikader, sorry for the delay. Correct - parallel mode will have a huge speedup for hi_res, but will otherwise just add overhead. The library code for hi_res is all serialized, and we needed a way to split up the work without redesigning the whole library. The approach of splitting the file and sending out another batch of api requests was a simple way to get the load balancer to do the scaling for us. Since hi_res pdf are so cpu heavy, this unlocks a huge speedup (it's all those Tesseract calls!) Any other filetype/strategy will be done long before the pdf gets split up.
Thanks for calling this out - I'll add a note to the readme. Or, if you have a moment, a pr would be a huge help :)
Also note that we're pushing to do pdf splitting on the client these days - we don't actually have parallel mode set on our server anymore. We've basically reimplemented this logic in the python client.
@awalker4 done! See https://github.com/Unstructured-IO/unstructured-api/pull/395.
And thanks for the tip about client-side splitting! I'm currently using the JS client so looking forward to see that supported over there soon 🙂
Describe the bug
I'm playing around with parallel mode and the
fast
strategy and I was surprised to notice that it took longer to partition my PDF. Is this expected? Is parallel mode only recommended when usinghi_res
mode?To Reproduce
strategy: 'fast'
chunking_strategy: 'by_title'
Environment:
I'm running
unstructured-api
as a Docker container on my local machineAll requests are made using cURL