JuliaLang / Distributed.jl

Create and control multiple Julia processes remotely for distributed computing. Ships as a Julia stdlib.
https://docs.julialang.org/en/v1/stdlib/Distributed/
MIT License
23 stars 9 forks source link

Decision: Use of `asyncmap` in pmap batch mode #43

Open amitmurthy opened 7 years ago

amitmurthy commented 7 years ago

pmap in batch mode uses a local asyncmap to process a batch - https://github.com/JuliaLang/julia/blob/9e3318c9840e7a9e387582ba861408cefe5a4f75/base/distributed/pmap.jl#L198

Considering that each computation in pmap is fairly large, and batch sizes small, an asyncmap would not have a major overhead and if the computation involves IO, quite beneficial.

For example, if the input is a list of file names to be processed, it is efficient to interleave I/O and computation and hence a local asyncmap is a better fit.

This issue is to take a decision whether to 1) Keep it as it is, i.e., no change - the batch is processed using asyncmap

2) Change it to a local map. If the computation involves I/O the caller would have to explicitly partition the input and the mapping function in turn would need to perform an asyncmap and a flatten on the final output.

3) Add another keyword arg to pmap, batch_function=map. To use asyncmap, the caller would need to explicitly specify batch_function=asyncmap. A user defined function (for example one that uses @threads ) can also be specified.

ViralBShah commented 7 years ago

@tanmaykm Was there an example of the asyncmap leading to a performance degradation? If we have a concrete example, that would be good to drive this.

I do like the idea of having a batch execution function inside of pmap, which could use threads.

ViralBShah commented 7 years ago

Another way to do the async case may be to allow an @async pmap(...) syntax for the asynchronous case.

tanmaykm commented 7 years ago

+1 for having the batch_function keyword, with maybe asyncmap as the default.

ViralBShah commented 7 years ago

If there is performance degradation for compute bound jobs, I would rather have map be the default and @async pmap(...) the way to do the async case. It may also be the case that pmap has too much functionality loaded into it.

amitmurthy commented 7 years ago

@async pmap(...) will not work as you intend it to - It will just execute the pmap in a new task.

I am for a new keyword arg, batch_function=map, i.e., with map as the default.

tanmaykm commented 7 years ago

With a correct batchsize I don't think a local asyncmap will have any adverse performance impact.

ViralBShah commented 7 years ago

I was imagining @async special casing on the pmap argument - but that is perhaps too ugly. batch_function is ok.

amitmurthy commented 7 years ago

That was my original thinking. But now I realize it is difficult to predict how it will be used. for example folks are trying with batchsize of 1000.