Unstructured-IO / unstructured

Open source libraries and APIs to build custom preprocessing pipelines for labeling, training, or production machine learning pipelines.
https://www.unstructured.io/
Apache License 2.0
8.64k stars 705 forks source link

Roman/fix ingest async connectors #3210

Closed rbiseck3 closed 3 months ago

rbiseck3 commented 3 months ago

Description

Choosing to use async needs to be very careful because if a connector is set to use async, the pipeline will not fan out the inputs via multiprocessing but instead it will be limited to run in a single process under the assumption it has more benefit from async due to heavy network traffic. This means the exact same code that is not optimized for async and is blocking will force the pipeline to perform worse than simply never marking the connector to use async since the pipeline will fan that out using multiprocessing.

All connectors and processes in the pipeline we revisited to make sure this criteria was met and updated accordingly: