facebook / prophet

Tool for producing high quality forecasts for time series data that has multiple seasonality with linear or non-linear growth.
https://facebook.github.io/prophet
MIT License
18.51k stars 4.53k forks source link

Installing Prophet R-Package in Databricks - Is there a faster way? #1857

Open nknauer opened 3 years ago

nknauer commented 3 years ago

Hi I am trying to install prophet r-package in a Databricks notebook.

Is there a faster way to do this than below? It takes 25min to install the prophet package this way.

Sys.setenv(DOWNLOAD_STATIC_LIBV8 = 1)
remotes::install_github("jeroen/V8")
devtools::install_version("rstantools", version = "2.0.0")
install.packages('prophet')
bletham commented 3 years ago

What is taking most of the time? My experience is that installing rstan and installing dplyr both take a very long time because there is a lot of C++ code to compile, and a lot of dependencies. The actual installation of prophet should take about 1-2 minutes.

nknauer commented 3 years ago

The dependency packages like V8 and rstantools are not the bottleneck unfortunately. Instead it is the prophet package itself. However within the prophet package there are dependencies as is to make it run such as: checkmate, matrixStats, zoo, inline, loo, dygraphs, extraDistr, RcppParallel, rstan, StanHeaders, xts, RcppEigen

Is there a better way to install prophet within Databricks? Seems unavoidable at this point it will take 20-25min each time I run my notebook

Here is a screenshot of all the load times by line item in Databricks: image image image

bletham commented 3 years ago

Yeah the issue is in all of those dependencies, which unfortunately are actually needed and so there isn't a workaround to skip some of them. I think this issue goes a bit beyond my understanding of both R packaging and databricks, so hopefully someone else will be able to chime in, but ultimately I think you'd need to somehow have the dependencies included in the image so that they don't have to be installed from scratch.

dwh1142 commented 3 years ago

One thing you can do in Databricks is use the renv package, and set your global package cache to a dbfs location. That way when you save off and then restore your environment, it will just copy it from the global package cache instead of downloading all of the dependencies every time. This is much much faster. Unfortunately, I can't get prophet to work in R on Databricks (Enterprise) because I have no way of installing the V8 dependency (can't reach outside of the private network), but luckily the Python version does come pre-installed on the ML runtimes.

nknauer commented 3 years ago

Very interesting, thanks! Never tried something like this before. I will attempt this and get back to you with results. Could this method also work with installing it directly to a cluster?

dwh1142 commented 3 years ago

I don't think it would work for a cluster scoped package (since that functionality just installs using install.packages from CRAN). Also, the idea of renv is to separate environments based on the project you are working on, so it could be argued you don't want one environment installed on the entire cluster. But if you set your global package cache environment variable in an init script, and install renv as a cluster scoped package, it will be very fast to restore environments once your cluster is spun up.

nknauer commented 3 years ago

Thanks, will try out the init script. I also asked on stackoverflow too https://stackoverflow.com/questions/69884057/installing-remote-r-package-to-databricks-cluster-rather-than-notebook