dask / distributed

A distributed task scheduler for Dask
https://distributed.dask.org
BSD 3-Clause "New" or "Revised" License
1.58k stars 719 forks source link

Intra-host communication #2046

Open mrocklin opened 6 years ago

mrocklin commented 6 years ago

Currently we have two kinds of comms:

  1. inproc:// for intra-process communication with queues
  2. tcp:// and tls:// for inter-process or inter-node communication with sockets

There is also a possibility in between for inter-process intra-node, that is for processes communicating to each other on the same machine but in different memory spaces.

Do we expect to see performance improvements from handling this? How expensive would this be to implement?

cc @pitrou in case he has general thoughts

adamklein commented 6 years ago

This is a great topic and could greatly benefit me in certain cases, especially if we could avoid overhead of serializing and copying. I have been generally curious about using the plasma store from arrow to share data across process boundaries but haven't had a chance to play with it.

mrocklin commented 6 years ago

To be clear, serialization will always be necessary if you want to move between processes. However, serialization is also pretty much free in the case of numpy arrays, pandas dataframes, or anything else that is mostly binary data.

The use of posix shared memory (the trick that Plasma uses) would probably be the biggest benefit here if it ends up being worthwhile.

jakirkham commented 4 years ago

UNIX domain sockets ( https://github.com/dask/distributed/issues/3630 ) would be one case of this. Though I guess the idea here is to handle any platform?

One option would be to copy frames into multiprocessing.Arrays before transmitting them. This may require some knowledge of where the data is going ( https://github.com/dask/distributed/issues/400 ) in order to benefit from this feature.