DS4SD / docling

Get your documents ready for gen AI
https://ds4sd.github.io/docling
MIT License
10.48k stars 507 forks source link

Add Parallelization Support to `convert_all()` Function with `num_worker` Parameter #369

Open naufalso opened 3 days ago

naufalso commented 3 days ago

Requested feature

I propose adding a parallelization option to the convert_all() function by introducing an additional parameter, such as num_worker. This feature would allow users to specify the number of workers to process conversions concurrently, significantly improving performance for large datasets.

Currently, the convert_all() function processes documents sequentially by returning an iterator. This approach can be slow when dealing with a large number of documents. Parallelization would enable faster processing and better utilization of multi-core systems.

Proposed changes:

Example usage:

results = converter.convert_all(source, num_worker=4)

Alternatives

  1. Users can manually implement parallelization by creating multiple instances of the Document Converter for each worker and invoking convert() using custom multiprocessing code. However, this requires additional effort and knowledge, which could be avoided by integrating the feature directly into the library.
  2. Continue using the current sequential approach, which may be acceptable for small datasets but is inefficient for larger ones.
cau-git commented 3 days ago

@naufalso Thanks for your input. Please see the following two discussions to understand our take on this:

naufalso commented 2 days ago

@naufalso Thanks for your input. Please see the following two discussions to understand our take on this:

Thank you @cau-git for the clarification and for pointing me to the relevant discussions.

I now see that this feature has already been considered and defined in the roadmap. I truly appreciate the team's great work on this project, and I look forward to the upcoming updates.

Please don't hesitate to reach out if there's any way I can contribute further.

Keep up the fantastic work!