dpc / pariter

Parallel iterator processing library for Rust
https://docs.rs/dpc-pariter
101 stars 3 forks source link

Unordered outputs #12

Closed Yomguithereal closed 9 months ago

Yomguithereal commented 9 months ago

Hi @dpc, thank you very much for your library. I use it to perform parallel work over rows of very large CSV files without needing to buffer the whole file into memory or chunk it.

I understand that emitting the results in the same order as the consumed iterator is a sensible default, but it also requires a hashmap and some computation time. Also, sometimes, some result might lag behind (e.g. when performing network requests), and this can either, depending on the implementation, cost a lot of memory or slow everything down.

Would it be of any value to you if I contributed a PR adding a way to allow results to be emitted in arbitrary order, as soon as they are received? I guess this could be an option of the custom builders, rather than specific methods suffixed _unordered since there are already many suffixes already and this would increase the matrix of available choices (normal, custom, scoped etc.).

I wish you a good day,

dpc commented 9 months ago

rayon is generally more popular and powerful and support and does support for unsorted parallel streaming: https://docs.rs/rayon/latest/rayon/iter/trait.ParallelBridge.html#tymethod.par_bridge

The only reason pariter was created was that rayon's model is clunky when you want the preserve the order. Given the extra complexity and a new axis of combinatory explosion, I don't think I'd want to add it.

Yomguithereal commented 9 months ago

Thank you for the answer @dpc. I hadn’t realized that rayon’s parallel bridge was able to work on lazy iterators without requiring to spend too much memory.

Do I understand correctly in your documentation that « handling backpressure » means that for ordered output some tasks may slow everything down (which is alright in my case) rather than leaking memory (which is not alright in my case)? Something rayon is incapable of, or at least clunkily, as you said?

In any case I understand your point and will close this issue. Thank you very much for your help.

dpc commented 9 months ago

Do I understand correctly in your documentation that « handling backpressure » means that for ordered output some tasks may slow everything down (which is alright in my case) rather than leaking memory (which is not alright in my case)? Something rayon is incapable of, or at least clunkily, as you said?

Yes,

https://github.com/dpc/pariter/blob/ea97cc651fd600fa5df6fc02319ad66fbd55ef0a/src/parallel_map.rs#L79

https://github.com/dpc/pariter/blob/ea97cc651fd600fa5df6fc02319ad66fbd55ef0a/src/parallel_map.rs#L200

And item that takes outstandingly long to process comparing to other items will stall the processing to limit the memory use to (roughly) buffer_size pending items. This allows trading memory use for processing time deviation amortization, like in a proper queue system.

pump_tx is used to send the incoming items to the thredpool, only as existing elements were already returned.

Yomguithereal commented 9 months ago

Thank you for the answer @dpc :)