Open mrocklin opened 6 years ago
This is a great topic and could greatly benefit me in certain cases, especially if we could avoid overhead of serializing and copying. I have been generally curious about using the plasma store from arrow to share data across process boundaries but haven't had a chance to play with it.
To be clear, serialization will always be necessary if you want to move between processes. However, serialization is also pretty much free in the case of numpy arrays, pandas dataframes, or anything else that is mostly binary data.
The use of posix shared memory (the trick that Plasma uses) would probably be the biggest benefit here if it ends up being worthwhile.
UNIX domain sockets ( https://github.com/dask/distributed/issues/3630 ) would be one case of this. Though I guess the idea here is to handle any platform?
One option would be to copy frames into multiprocessing.Array
s before transmitting them. This may require some knowledge of where the data is going ( https://github.com/dask/distributed/issues/400 ) in order to benefit from this feature.
Currently we have two kinds of comms:
There is also a possibility in between for inter-process intra-node, that is for processes communicating to each other on the same machine but in different memory spaces.
Do we expect to see performance improvements from handling this? How expensive would this be to implement?
cc @pitrou in case he has general thoughts