RevolutionAnalytics / RHadoop

RHadoop
https://github.com/RevolutionAnalytics/RHadoop/wiki
763 stars 278 forks source link

1-install deployments #163

Closed piccolbo closed 11 years ago

piccolbo commented 11 years ago

The goal of #66 0-install was to make a working Hadoop Mapreduce cluster and a local installation of R and rmr all that was needed to have a working rmr cluster. While the 0-install branch contains a proof of concept of this, people have questioned both its portability and the risk inherent in installing R on the fly on a cluster without following accepted procedures. While I disagree with both observations, as an intermediate step it has been suggested that splitting the problem of dynamically installing R and doing the same for all necessary packages and targeting only the latter would be a step forward, the idea being that R installs are a heavy operation needed at most once a year whereas new packages are being adopted or created all the time by the working data scientist, hence a more lightweight process for installing the latter ones is necessary. This is what this issue attempts to solve. 1-install means: "you just have to install R" the rest is automatic.

piccolbo commented 11 years ago

In dev is a first implementation of this. I had to apply some monkey patch to prevent install.packages from corrupting stdout with messages. This issue has been reported on a R mailing list but dismissed by people with commit power, if anyone wonders why I don't submit a patch. The strange thing which I need to document here is that the data in stdout was the stderr of a command invoked with system. If it had remained in standard error, it would have been perfectly fine, but it ended up on the Rscript stdout where only data is allowed because of hadoop streaming architecture. The only solution I could come up with was to ignore stderr from the system call. Therefore if an install fails, it will fail without error messages (not silently, because, install.packages will fail too, but without diagnostic messages). This needs to improve.

piccolbo commented 11 years ago

After much work it turned out that it was impossible to reliably suppress installation output and the side effect of leaving the cluster in an inconsistent state was also considered a disadvantage, hence this approach has been abandoned.