Open amitmurthy opened 7 years ago
This would solve major connection time issues on large clusters that we have repeatedly seen.
Just wanted mention that it also seemed that https://github.com/JuliaLang/julia/pull/22588 made adding remote workers noticeably faster.
I wonder how and why JuliaLang/julia#22588 affected worker startup time. @vtjnash ?
@andreasnoack / @ViralBShah care to comment on the interface for lazy connection setup in JuliaLang/julia#22814?
Sorry for the noise here. Just did some more systematic timings and my previous impression must have been based on differences in the connection.
Bump – are we still planning on doing this?
bump
The default
all_to_all
topology connects all processes to each other. While this is fine for small clusters, the total number of TCP connections increases rapidly as (N^2)/2.Considering that a large class of parallel problems only need master-worker connections we should change the default topology to
all_to_all_lazy
where worker-worker connections are setup only on the first request from a worker to another worker. And also introduce another topologymaster_routed
which only connects master to workers, and in case of a worker-worker call, routes the request through the master.To summarize, implement 2 new topologies:
1)
all_to_all_lazy
where worker-worker connections are setup lazily, and is the default for addprocs and2)
master_routed
in which only the master connects to workers and worker-worker messages are routed via the master.