Add Parallelization Support to `convert_all()` Function with `num_worker` Parameter

naufalso commented 3 days ago

Requested feature

I propose adding a parallelization option to the convert_all() function by introducing an additional parameter, such as num_worker. This feature would allow users to specify the number of workers to process conversions concurrently, significantly improving performance for large datasets.

Currently, the convert_all() function processes documents sequentially by returning an iterator. This approach can be slow when dealing with a large number of documents. Parallelization would enable faster processing and better utilization of multi-core systems.

Proposed changes:

Add a num_worker parameter to the convert_all() function.
Modify the function to use a parallel execution library (e.g., concurrent.futures or multiprocessing) to handle multiple conversion tasks simultaneously.

Example usage:

results = converter.convert_all(source, num_worker=4)

Alternatives

Users can manually implement parallelization by creating multiple instances of the Document Converter for each worker and invoking convert() using custom multiprocessing code. However, this requires additional effort and knowledge, which could be avoided by integrating the feature directly into the library.
Continue using the current sequential approach, which may be acceptable for small datasets but is inefficient for larger ones.

cau-git commented 3 days ago

@naufalso Thanks for your input. Please see the following two discussions to understand our take on this:

naufalso commented 2 days ago

@naufalso Thanks for your input. Please see the following two discussions to understand our take on this:

https://github.com/DS4SD/docling/discussions/377

https://github.com/DS4SD/docling/discussions/306

Thank you @cau-git for the clarification and for pointing me to the relevant discussions.

I now see that this feature has already been considered and defined in the roadmap. I truly appreciate the team's great work on this project, and I look forward to the upcoming updates.

Please don't hesitate to reach out if there's any way I can contribute further.

Keep up the fantastic work!

DS4SD / docling

Add Parallelization Support to `convert_all()` Function with `num_worker` Parameter #369

Requested feature

Alternatives