futureverse / future

:rocket: R package: future: Unified Parallel and Distributed Processing in R for Everyone
https://future.futureverse.org
957 stars 85 forks source link

support Sparklyr in future mode #286

Open harryprince opened 5 years ago

harryprince commented 5 years ago

Hi future team, I found future is a great framework for distributed data processing. sparklyr::spark_apply is doing the similar things , which support local mode/ yarn-client/ yarn-cluster mode.

wish to integrate spark to future framework.

HenrikBengtsson commented 5 years ago

Hi, this an interesting idea.

It's been a while I dove into the inner parts of sparklyr, but if it's now possible to "launch" a single expression on Spark, check whether it's done (in a non-blocking way), and then collect the results, "all that is needed" is to implement future(), resolved(), and value() on top of sparklyr and we're home. Then, with then future.tests validator we can make sure it conforms to the core Future API. Then, it'll work everywhere.

EDIT 2020-01-07: Added an important but missing "if" above.

harryprince commented 5 years ago

I think you can cooperate with RStudio team, I will propose an another issue on sparklyr repo.

smingerson commented 2 years ago

In sparklyr 1.7.7, there is registerDoSpark to register Spark as a parallel backend for foreach. I was wondering if between that and doFuture there was a path forward for using Spark as a future backend? Perhaps related, I see that in sparklyr::spark_apply() there appears to be support for barrier execution, which is mentioned in the linked sparklyr issue.