Open javierluraschi opened 5 years ago
Sounds really interesting. Unfortunately, I don't have any CDH cluster for evaluation since I left Cloudera. Technically, it's easy to build parcel with Arrow, but I don't know the difficulty on evaluation after a major update to CDH 6.
It'd be very helpful if you or someone could support evaluation.
The preparation of R environment for a parcel is based on conda. I've started an initial investigation on a local environment and I confirmed arrow and sparklyr works fine with miniconda docker.
https://gist.github.com/chezou/756775624b3272b8b5db1711d9090e88
Next step is to replace conda preparation with yaml based one, then I will build parcel with travis-ci.
Note: We need to set TAR=/bin/tar
as an environ variable for installation of arrow
@javierluraschi I'd like to confirm the meaning of "support for Apache Arrow". Do you intend to include a specific version of arrow
package for R, or just including Apache Arrow dependency? It can be out-of-box if we could include R arrow
package, but it might cause some version confliction between edge node and cluster workers.
I got support for evaluation cluster. I've started working creating parcel including R arrow package on this branch https://github.com/chezou/cloudera-parcel/tree/apache-arrow
Released parcels on GitHub https://github.com/chezou/cloudera-parcel/releases/tag/3.5.1.p0.0.1 . Will test on CDH cluster.
Work to support Apache Arrow in
sparklyr
is on its way, see https://arrow.apache.org/blog/2019/01/25/r-spark-improvements/.Therefore, it would be ideal to support Arrow in this parcel as well with something like:
Ideally, using the following mirror that will be update only to the latest supported version of
sparklyr
: