delta-rho / docs-RHIPE

Tutorial and function reference for trelliscope
7 stars 2 forks source link

Installation of additional R packages needs clarification #5

Open lhsego opened 8 years ago

lhsego commented 8 years ago

In trying to install packages to HDFS via RHIPE, I'm finding the following documentation confusing: http://tessera.io/docs-RHIPE/#install-and-push (and it's companion documentation for the cluster has the same problem).

Specifically, where should I store the package sources (Rpackage_version.tar.gz) on HDFS and how do I push them to HDFS?

lhsego commented 8 years ago

I've figured it out now. But reading these instructions for the first time, it was a bit difficult to follow. I'd recommend that after this line:

bashRhipeArchive() creates the actual archive of your installations and names it as R.Pkg

you add the following to clarify:

Specifically, any packages that you have previously installed on the R session server will automatically be pushed up to HDFS and will then be available for use by Rhipe.

lhsego commented 8 years ago

One last comment. This phrase isn't clear:

You do not need them again until you reinstall.

Reinstall what? It makes it sound like re-installation of something is part of the process. Is it?

saptarshiguha commented 8 years ago

Sorry for the late reply. I didn't write this, but IIRC jeremiah did.

What i have been using this function, see https://gist.github.com/saptarshiguha/1f8f03b55bb171959b66

This does (i assume so) what the bashRhipeArchive does.

After that i initialize Rhipe as this

library(Rhipe) rhinit() _## using the environment variable RDISTRIBUTION i can choose which R distribution i want to use. _RDIST <- if(Sys.getenv("R_DISTRIBUTION")=="") "R31b_74" else Sys.getenv("RDISTRIBUTION") m <- rhoptions()$mropts m$R_HOME = sprintf("%s/R",RDIST) m$R_HOME_DIR = sprintf("./%s/R",RDIST) m$R_SHARE_DIR = sprintf("./%s/R/share",RDIST) m$R_INCLUDE_DIR = sprintf("./%s/R/include",RDIST) m$R_DOC_DIR = sprintf("./%s/R/doc",RDIST) m$PATH = sprintf("./%s/R/bin:./%s/:$PATH",RDIST,RDIST) m$LD_LIBRARY_PATH = sprintf("./%s/:./%s/R/lib:/usr/lib64",RDIST,RDIST)

rhoptions(runner = sprintf("./%s/RhipeMapReduce --silent --vanilla",RDIST), zips = c(sprintf("/user/sguha/%s.tar.gz",RDIST)), HADOOP.TMP.FOLDER = sprintf("/user/%s/tmp/",USER), mropts = m, job.status.overprint =TRUE, write.job.info =TRUE)

Note every time you install a new R package and if you need it during the MR job, you need to rebuild this R archive. Hope this helps Cheers Saptarshi

On Mon, Oct 26, 2015 at 3:02 PM, Landon Sego notifications@github.com wrote:

One last comment. This phrase isn't clear:

You do not need them again until you reinstall.

Reinstall what? It makes it sound like re-installation of something is part of the process. Is it?

— Reply to this email directly or view it on GitHub https://github.com/tesseradata/docs-RHIPE/issues/5#issuecomment-151297029 .