Kernel restarting in Vertex AI Jupterlab

lukmaz commented 1 year ago

Project Robyn

Describe issue

I am trying to run the demo script in Jupyterlab on a n1-standard-4 machine in Vertex AI (4 vCPUs, 15 GB RAM). The script crashes in the robyn_run stage, causing a kernel restart.

I suspect that it crashes due to exceeding the memory limit used by the default number of threads (for cores = NULL it runs 4 - 1 = 3 threads). I noticed that when I change to cores = 1 in robyn_run arguments, it does not crash in robyn_run. So the reason looks to be a low memory to cores ratio in the machine.

The problem that I cannot workaround is that the script crashes in a similar way in the robyn_refresh stage, possibly in the plotting code, since the last message before the crash logged to the output is: Plotting 4 selected models on 3 cores.... For this issue, I checked also a higher memory Vertex AI machine (n2-highmem-16, 16 vCPUs, 128 GB RAM) and the problem still persists. I didn't find an option to reduce the number of cores used in robyn_refresh.

Is it possible to limit the number of cores used in robyn_refresh, similarly as it is possible in robyn_run?

Provide reproducible example

Run demo.R on n1-standard-4 or n2-highmem-16 machine in Vertex AI.

Environment & Robyn version

Make sure you're using the latest Robyn version before you post an issue.

Check and share Robyn version: packageVersion("Robyn"): 3.10.3.9000

R version (Please, check and share: sessionInfo() or R.version$version.string):


Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Debian GNU/Linux 10 (buster)

Matrix products: default BLAS: /usr/lib/x86_64-linux-gnu/openblas/libblas.so.3 LAPACK: /usr/lib/x86_64-linux-gnu/libopenblasp-r0.3.5.so

locale: [1] LC_CTYPE=C.UTF-8 LC_NUMERIC=C LC_TIME=C.UTF-8
[4] LC_COLLATE=C.UTF-8 LC_MONETARY=C.UTF-8 LC_MESSAGES=C.UTF-8
[7] LC_PAPER=C.UTF-8 LC_NAME=C LC_ADDRESS=C
[10] LC_TELEPHONE=C LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C

attached base packages: [1] stats graphics grDevices utils datasets methods base

other attached packages: [1] doRNG_1.8.6 rngtools_1.5.2 foreach_1.5.2 Robyn_3.10.3.9000

loaded via a namespace (and not attached): [1] nlme_3.1-162 bitops_1.0-7 matrixStats_0.63.0
[4] lubridate_1.9.2 doParallel_1.0.17 RColorBrewer_1.1-3
[7] httr_1.4.5 rprojroot_2.0.3 rstan_2.21.8
[10] repr_1.1.6 tools_4.2.3 utf8_1.2.3
[13] R6_2.5.1 rpart_4.1.19 mgcv_1.8-42
[16] colorspace_2.1-0 withr_2.5.0 tidyselect_1.2.0
[19] gridExtra_2.3 prettyunits_1.1.1 processx_3.8.1
[22] compiler_4.2.3 textshaping_0.3.6 glmnet_4.1-7
[25] cli_3.6.1 rvest_1.0.3 xml2_1.3.3
[28] labeling_0.4.2 scales_1.2.1 ggridges_0.5.4
[31] callr_3.7.3 rappdirs_0.3.3 systemfonts_1.0.4
[34] pbdZMQ_0.3-9 stringr_1.5.0 digest_0.6.31
[37] StanHeaders_2.21.0-7 extraDistr_1.9.1 base64enc_0.1-3
[40] pkgconfig_2.0.3 htmltools_0.5.5 fastmap_1.1.1
[43] rlang_1.1.0 shape_1.4.6 prophet_1.0
[46] generics_0.1.3 farver_2.1.1 jsonlite_1.8.4
[49] dplyr_1.1.2 zip_2.3.0 inline_0.3.19
[52] RCurl_1.98-1.12 magrittr_2.0.3 loo_2.6.0
[55] patchwork_1.1.2 Matrix_1.5-3 Rcpp_1.0.10
[58] IRkernel_1.3.2 munsell_0.5.0 fansi_1.0.4
[61] reticulate_1.28 lifecycle_1.0.3 stringi_1.7.12
[64] pROC_1.18.0 yaml_2.3.7 pkgbuild_1.4.0
[67] plyr_1.8.8 grid_4.2.3 parallel_4.2.3
[70] crayon_1.5.2 lattice_0.20-45 IRdisplay_1.1
[73] splines_4.2.3 lares_5.2.1 ps_1.7.5
[76] pillar_1.9.0 uuid_1.1-0 codetools_0.2-19
[79] stats4_4.2.3 glue_1.6.2 evaluate_0.20
[82] rpart.plot_3.1.1 RcppParallel_5.1.7 png_0.1-8
[85] vctrs_0.6.2 nloptr_2.0.3 gtable_0.3.3
[88] purrr_1.0.1 tidyr_1.3.0 ggplot2_3.4.2
[91] openxlsx_4.2.5.2 h2o_3.40.0.1 ragg_1.2.5
[94] survival_3.5-3 minpack.lm_1.2-3 tibble_3.2.1
[97] iterators_1.0.14 timechange_0.2.0 here_1.0.1 ```

laresbernardo commented 1 year ago

Hi @lukmaz You're right in mostly everything you mentioned. The reason cores = 1 runs correctly is because we turn off parallel computing in this scenarios. cores = NULL will run all available minus 1. So, probably, your Jupyterlab config is having issues with parallel computing as it is.

Is it possible to limit the number of cores used in robyn_refresh, similarly as it is possible in robyn_run?

Yes. You can limit the cores (or turn off parallel computing) by setting OutputCollect$cores <- 1 before passing it to robyn_refresh().

lukmaz commented 1 year ago

I don't see how OutputCollect is being passed to robyn_refresh():

robyn_refresh <- function(json_file = NULL,
                          robyn_object = NULL,
                          dt_input = NULL,
                          dt_holidays = Robyn::dt_prophet_holidays,
                          refresh_steps = 4,
                          refresh_mode = "manual",
                          refresh_iters = 1000,
                          refresh_trials = 3,
                          plot_folder = NULL,
                          plot_pareto = TRUE,
                          version_prompt = FALSE,
                          export = TRUE,
                          calibration_input = NULL,
                          ...) {

I modified the OutputCollect variable that is being used throughout the script, but it doesn't seem to influence the robyn_refresh() function since it still runs on 3 cores after the modification.

laresbernardo commented 1 year ago

Did you actually try passing cores = 1 within robyn_refresh(...)? Those ... are passed to robyn_run() internally. That's actually the most straightforward way.

edavishydro commented 1 year ago

I've been using Robyn on Vertex AI for ~1 year now. I believe the issue has to do with the %dorng% calls messing up the parallel computing in a JupyterLab environment.

The only workaround I've found that works since Robyn 3.6 is the following:

Change any instances of %dorng% in model.Rand plots.R to %do%
add importFrom(foreach, "%do%") to the NAMESPACE
Re-compile and install Robyn with modifications

I'm sure this isn't the best workaround for reproducibility, because you're getting rid of the ability to assign a seed, but it's worked for me!

laresbernardo commented 1 year ago

Actually, we've changed A LOT since version 3.6: https://github.com/facebookexperimental/Robyn/releases If you guys are up to it, we are open to enable cloud instances and these kinds of solutions (automatically or via a new parameter). If foreach's %do% works exactly as the current version and also enables you to run on Vertex, we can migrate. If you're willing to develop a solution, test it, and build a PR, we're open to implement it.

laresbernardo commented 1 year ago

There's also this external post on Medium that can help you guys setup Vertex AI: Marketing Mix Modelling with Robyn on Vertex AI by Olejniczak Lukasz [Customer Engineer at Google Cloud (Smart Analytics & ML)]

lukmaz commented 1 year ago

@laresbernardo , you are right, passing cores = 1 directly to robyn_refresh() works and it resolves the memory issues on Vertex AI, thanks!

I know the Medium post on running Robyn on Vertex AI. Unfortunately it's slightly outdated and does not work out of the box - Vertex AI seems to not accept custom Docker images build on top of the R image. Actually, I talked with Lukasz Olejniczak and he has not been running Robyn on Vertex AI since the publication of the article and is not aware of the issues with the current versions of Robyn and Vertex AI.

andraste commented 1 year ago

Trying to run the demo code on Vertex, did the following: installed on Vertex using instructions from that Medium post, then with the demo code it was consistently failing for me too at the OutputCollect <- robyn_outputs ..etcstep. The solution was commenting out Sys.setenv(R_FUTURE_FORK_ENABLE = "true") and options(future.fork.enable = TRUE) then as suggested, setting cores = 1 at the robyn_run() step. Thanks for starting this thread!

gufengzhou commented 1 year ago

I've been using Robyn on Vertex AI for ~1 year now. I believe the issue has to do with the %dorng% calls messing up the parallel computing in a JupyterLab environment.

The only workaround I've found that works since Robyn 3.6 is the following:

Change any instances of %dorng% in model.Rand plots.R to %do%

add importFrom(foreach, "%do%") to the NAMESPACE

Re-compile and install Robyn with modifications

I'm sure this isn't the best workaround for reproducibility, because you're getting rid of the ability to assign a seed, but it's worked for me!

Hi @lukmaz , I'm curious if you could use multi-cores before, or it's an issue just lately? How is the speed with 1 core compared to before? Also, as Bernardo mentioned, one of the improvement lately is an 88% object size reduction. Could that have solve your memory-to-core-ratio issue?

lukmaz commented 1 year ago

I didn't run Robyn on Vertex AI before. I first tried in ~February this year and had memory issues from the beginning. I didn't notice any change after the last improvements by 88%. I didn't measure it precisely, but I noticed that the training runs much slower on 1 core, so probably the parallelism works fine (if it works at all and does not crash).

facebookexperimental / Robyn