Open KnutJaegersberg opened 2 years ago
Am I right map_batches right now supports single core processing as it is implemented as a while loop over the batches?
Yes, right now it only uses a single core. Other operations implemented in C++ (like the dplyr ones) are multi-threaded, but I don't think we have a way to arbitrary R code in multiple threads.
Do you plan to bring this very handy capability to R arrow, too?
Maybe. From Futures package:
This package implements sequential, multicore, multisession, and cluster futures. With these, R expressions can be evaluated on the local machine, in parallel a set of local machines, or distributed on a mix of local and remote machines.
I'm not sure if we would target non-local execution, but we might target multisession if we had a good way of sharing the Arrow RecordBatches with subprocesses easily.
I might research this a bit. @paleolimbot you have any thoughts on this?
There's a good discussion on the mailing list about this in the context of Python user-defined-functions, which were just added ( https://lists.apache.org/thread/hwcrnf77j8p4dvyzoc3v5cwgws83nvqp ). In the next few months we'll have R user-defined functions too, which will serve a related purpose (do things with the Arrow compute engine that aren't implemented in C++). The gist of the mailing list discussion is that a lot of the parallelism is already handled by Arrow and when R/Python code dispatches more workers it might not result in the performance gain that one might expect.
For map_batches()
it's a little easier to think of a use-case where dispatching incoming RecordBatches to workers running on other cores would help (my mind jumps to performing a spatial join, which won't be in the query engine anytie soon).
I think we have almost all the infrastructure we need to do this...we can read/write from R connections (like socketConnection()
), which gives us some options for interprocess communication if they're not already present/too slow in the future package abstraction.
I had some fun trying to wire all of that up and ran into some problems, which expose some of the limits of our current infrastucture. Perhaps a good scope of what we should support in Arrow is the tools to make this work (while the actual implementation could/should live in an extension package?)
Regular user of disk.frame. I love that packages map function, which allows to apply custom R functions on a larger than ram dataset in paralllel over the batches by using the future package. Am I right map_batches right now supports single core processing as it is implemented as a while loop over the batches? Do you plan to bring this very handy capability to R arrow, too? Sure it must be slower than arrows c++ based functionality, but it is still super useful.