Plotting uses huge amount of RAM, causes crash, Out of Memory Errors

facebookexperimental / Robyn

Robyn is an experimental, AI/ML-powered and open sourced Marketing Mix Modeling (MMM) package from Meta Marketing Science. Our mission is to democratise modeling knowledge, inspire the industry through innovation, reduce human bias in the modeling process & build a strong open source marketing science community.

https://facebookexperimental.github.io/Robyn/

MIT License

1.16k stars 346 forks source link

Plotting uses huge amount of RAM, causes crash, Out of Memory Errors #858

Open MC-Dave opened 1 year ago

MC-Dave commented 1 year ago

Project Robyn

Describe issue

robyn_outputs is consuming a huge amount of RAM during plotting. We need the csv outputs, but have no use for the plots. It appears that robyn_outputs is consuming a massive amount of RAM while it is carrying out plotting, eg "Plotting X selected models on Y cores" The RAM usage is causing failures for us in our production systems. We have not found any way to disable plotting outputs, while also preserving CSV outputs. one can only choose to enable both csv outputs and plotting or disable them both via the "export" parameter.

Note: we currently limit our instances to use only 1/5th of the systems total cores. So if it has 20 cores, we only use 4.

On a system with 192GB of RAM and 48 cores, we use 9 cores. During training the system only uses ~2% of available RAM. Right at the end of the process, just after it prints "Plotting X selected models on Y cores" it jumps to 100% utilization and crashes the execution.

Provide reproducible example

The issue is transient. Re-running a failed execution with the exact same inputs and datasets will succeed sometimes and fail others.

Environment & Robyn version

ROBYN VERSION R@fb3688a9ee9fe3a7836e6fea1ad386080a3fb00c Installed via remotes::install_github("facebookexperimental/Robyn/R@fb3688a9ee9fe3a7836e6fea1ad386080a3fb00c")

R Version 4.3.2

MC-Dave commented 1 year ago

Note: We only encounter this issue when running refresh jobs. When running a full train @ 5 trials, 5000 iterations, we never encounter the error. When run a refresh @ 5 trials, 5000 iterations, this issue occurs intermittently.

ToddMinerTech commented 1 year ago

Same issue here, cannot figure out the pattern of when it succeeds or fails.

gufengzhou commented 1 year ago

Sorry for the late reply, you can set the arg plot_pareto = FALSE in robyn_outputs() to deactivate the pngs, see ?robyn_outputs. We'll look into the root cause in the future, but probably not very soon.

richin13 commented 10 months ago

I'm also being affected by this issue.

Worth noting I already use plot_pareto = FALSE when calling Robyn::robyn_refresh @gufengzhou

Logs:

>>> Recreating model 2_131_3
Imported JSON file succesfully: RobynModel-2_131_3.json
>> Running feature engineering...
Input data has 760 days in total: 2020-11-01 to 2022-11-30
Refresh #3 model is built on rolling window of 700 day: 2020-12-13 to 2022-11-12
Rolling window moving forward: 4 days
>>> Calculating response curves for all models' media variables (14)...
Successfully recreated model ID: 2_131_3
>>> Building refresh model #4 in manual mode
>>> New bounds freedom: 0.57%
>> Running feature engineering...
Input data has 760 days in total: 2020-11-01 to 2022-11-30
Refresh #4 model is built on rolling window of 700 day: 2020-12-17 to 2022-11-16
Rolling window moving forward: 4 days
Fitting time series with all available data...
Using geometric adstocking with 53 hyperparameters (52 to iterate + 1 fixed) on 7 cores
>>> Starting 3 trials with 1000 iterations each using TwoPointsDE nevergrad algorithm...
  Running trial 1 of 3

  |
  |======================================================================|  99%

  Finished in 1.09 mins
  Running trial 2 of 3

  |
  |======================================================================|  99%

  Finished in 1.13 mins
  Running trial 3 of 3

  |
  |======================================================================|  99%

  Finished in 1.29 mins
>>> Running Pareto calculations for 3000 models on auto fronts...
Killed

richin13 commented 7 months ago

Hello! Any updates here? This keeps being an issue and increasing the task resources is not feasible (as OP is experiencing the same error on a 192GB RAM system)

gufengzhou commented 7 months ago

I've been using our standard dataset and can't really reproduce this issue. Although I've heard from multiple sources that Windows users are seeing this more frequently. It does consume more memory in the outputs/ plotting functions, regardless of refresh. I'm trying to test it on a larger dataset. Will report back.

gufengzhou commented 6 months ago

So far I couldn't reproduce the issue. I'm on a Mac M1 Pro. I've just tested with 15 media vars, using weibull for more hyperparameters and ran it on 5k 4 iterations, then refreshed it on 2k 4 iters. It ran through.

@richin13 what machine/ system are you using?

richin13 commented 6 months ago

@gufengzhou we're running in AWS ECS Fargate, but I was able to reproduce in my local system (Linux Ubuntu 22.04, 16 GB of RAM and a 11th Gen Intel© Core™ i5-1135G7 @ 2.40GHz × 4)

We're running Robyn 3.10.3 though and I'm working on upgrading to 3.10.5 to see if that resolves it (maybe you're running 3.10.5 and that confirms the leak is fixed?)

gufengzhou commented 6 months ago

I'm running on the latest 3.10.7. Please try and let me know.

richin13 commented 6 months ago

A bit of a tangent but are you guys planning on cutting the 3.10.7 release any time soon? This is a prod system so I'd be hesitant to install the version of master and we usually rely on whatever is published on the github releases page as we assume those are considered stable but I'm not seeing 3.10.7 there

richin13 commented 6 months ago

@gufengzhou it seems like bumping to 3.10.5 fixes the memory leak as the process does not get killed anymore. However I'm now getting a different error later in the process:

>>> Calculating clusters for model selection using Pareto fronts...
Couldn't automatically create clusters: Error: empty cluster: try a better set of initial centers
Error in UseMethod("mutate") : 
  no applicable method for 'mutate' applied to an object of class "NULL"
In addition: Warning messages:
1: In robyn_chain(json_file) :
  Can't replicate chain-like results if you don't follow Robyn's chain structure
2: In prophet_decomp(dt_transform, dt_holidays = InputCollect$dt_holidays,  :
  Currently, there's a known issue with prophet that may crash this use case.
 Read more here: https://github.com/facebookexperimental/Robyn/issues/472
3: In hyper_collector(InputCollect, hyper_in = InputCollect$hyperparameters,  :
  Provided train_size but ts_validation = FALSE. Time series validation inactive.
Error in clusterCollect$data : $ operator is invalid for atomic vectors
Calls: main ... same_src -> same_src.data.frame -> is.data.frame -> select
Execution halted

My guess is that, since these models were created with 3.10.3 they no longer can be refreshed using 3.10.5? Is that the case? Is there anything we can do to those models to be able to refresh them using 3.10.5? Thanks

gufengzhou commented 6 months ago

good to know the memory issue is gone. I'd strongly recommend to update to 3.10.7 that's been stable for most use cases and includes numerous fixes for refresh, incl. the chain error, the ts_validation error etc. We'll push this version to the github release page as well as CRAN in few weeks.

@laresbernardo FYI the latest versions don't cause memory issue anymore on AWS apparently

richin13 commented 6 months ago

Will do, thanks!