Open nknauer opened 3 years ago
What is taking most of the time? My experience is that installing rstan and installing dplyr both take a very long time because there is a lot of C++ code to compile, and a lot of dependencies. The actual installation of prophet should take about 1-2 minutes.
The dependency packages like V8 and rstantools are not the bottleneck unfortunately. Instead it is the prophet package itself.
However within the prophet package there are dependencies as is to make it run such as:
checkmate, matrixStats, zoo, inline, loo, dygraphs, extraDistr, RcppParallel, rstan, StanHeaders, xts, RcppEigen
Is there a better way to install prophet within Databricks? Seems unavoidable at this point it will take 20-25min each time I run my notebook
Here is a screenshot of all the load times by line item in Databricks:
Yeah the issue is in all of those dependencies, which unfortunately are actually needed and so there isn't a workaround to skip some of them. I think this issue goes a bit beyond my understanding of both R packaging and databricks, so hopefully someone else will be able to chime in, but ultimately I think you'd need to somehow have the dependencies included in the image so that they don't have to be installed from scratch.
One thing you can do in Databricks is use the renv
package, and set your global package cache to a dbfs location. That way when you save off and then restore your environment, it will just copy it from the global package cache instead of downloading all of the dependencies every time. This is much much faster. Unfortunately, I can't get prophet to work in R on Databricks (Enterprise) because I have no way of installing the V8 dependency (can't reach outside of the private network), but luckily the Python version does come pre-installed on the ML runtimes.
Very interesting, thanks! Never tried something like this before. I will attempt this and get back to you with results. Could this method also work with installing it directly to a cluster?
I don't think it would work for a cluster scoped package (since that functionality just installs using install.packages
from CRAN). Also, the idea of renv is to separate environments based on the project you are working on, so it could be argued you don't want one environment installed on the entire cluster. But if you set your global package cache environment variable in an init script, and install renv
as a cluster scoped package, it will be very fast to restore environments once your cluster is spun up.
Thanks, will try out the init script. I also asked on stackoverflow too https://stackoverflow.com/questions/69884057/installing-remote-r-package-to-databricks-cluster-rather-than-notebook
Hi I am trying to install prophet r-package in a Databricks notebook.
Is there a faster way to do this than below? It takes 25min to install the prophet package this way.