dask / distributed

A distributed task scheduler for Dask
https://distributed.dask.org
BSD 3-Clause "New" or "Revised" License
1.57k stars 718 forks source link

DaskR prototype with rpy2 and reticulate #2254

Open mrocklin opened 6 years ago

mrocklin commented 6 years ago

@quasiben has been playing with Dask and R with reticulate and rpy2 with the objective of providing Dask's concurrent.futures API to R (from which they could presumably build other systems). This is somewhat tricky because you have both a Python and R session living side-by-side and need to move things between them from time to time (hopefully infrequently) using reticulate or rpy2. There are a variety of ways to do this, I thought I'd lay out the way that makes the most sense to me.

mrocklin commented 6 years ago

For size? https://stat.ethz.ch/R-manual/R-devel/library/utils/html/object.size.html

dhirschfeld commented 6 years ago

Maybe arrow/feather will (one day) provide a better way to interop with R.

It seems a fully-functional arrow implementation is still a ways off though: https://github.com/apache/arrow/pull/2489

mrocklin commented 6 years ago

Arrow would be important here if we wanted to run pandas code on R dataframes or R code on pandas dataframes. That isn't the case here. This issue is restricted to just calling R code on R objects (dataframes and otherwise). I don't think that Arrow is useful in this particular setting.

mrocklin commented 6 years ago

RPy2 has issues with threads. https://bitbucket.org/rpy2/rpy2/issues/449/rpy2-blocks-other-python-threads