chezou / cloudera-parcel

customized cloudera-parcel
Other
13 stars 7 forks source link

Support for Apache Arrow #3

Open javierluraschi opened 5 years ago

javierluraschi commented 5 years ago

Work to support Apache Arrow in sparklyr is on its way, see https://arrow.apache.org/blog/2019/01/25/r-spark-improvements/.

Therefore, it would be ideal to support Arrow in this parcel as well with something like:

sudo yum install -y https://packages.red-data-tools.org/centos/red-data-tools-release-latest.noarch.rpm
sudo sed -i 's/\$releasever/6/g' /etc/yum.repos.d/red-data-tools.repo
sudo yum install -y --enablerepo=red-data-tools arrow-devel

Ideally, using the following mirror that will be update only to the latest supported version of sparklyr:

sudo yum install -y https://arrowlib.rstudio.com/centos/red-data-tools-release-latest.noarch.rpm
sudo sed -i 's/\$releasever/6/g' /etc/yum.repos.d/red-data-tools.repo
sudo yum install -y --enablerepo=red-data-tools arrow-devel
chezou commented 5 years ago

Sounds really interesting. Unfortunately, I don't have any CDH cluster for evaluation since I left Cloudera. Technically, it's easy to build parcel with Arrow, but I don't know the difficulty on evaluation after a major update to CDH 6.

It'd be very helpful if you or someone could support evaluation.

chezou commented 5 years ago

The preparation of R environment for a parcel is based on conda. I've started an initial investigation on a local environment and I confirmed arrow and sparklyr works fine with miniconda docker.

https://gist.github.com/chezou/756775624b3272b8b5db1711d9090e88

Next step is to replace conda preparation with yaml based one, then I will build parcel with travis-ci.

Note: We need to set TAR=/bin/tar as an environ variable for installation of arrow

chezou commented 5 years ago

@javierluraschi I'd like to confirm the meaning of "support for Apache Arrow". Do you intend to include a specific version of arrow package for R, or just including Apache Arrow dependency? It can be out-of-box if we could include R arrow package, but it might cause some version confliction between edge node and cluster workers.

chezou commented 5 years ago

I got support for evaluation cluster. I've started working creating parcel including R arrow package on this branch https://github.com/chezou/cloudera-parcel/tree/apache-arrow

chezou commented 5 years ago

Released parcels on GitHub https://github.com/chezou/cloudera-parcel/releases/tag/3.5.1.p0.0.1 . Will test on CDH cluster.