Closed ohad-monday closed 2 years ago
Hey, this is very strange. I can't think of any possible cause out of the box. We'll need a reproducible dataset to replicate this error, including a dataset (true value masked) and your demo.R file for model specification. If you don't want to share publicly, please send it via email @laresbernardo and me (bernardolares@fb.com, gufeng@fb.com)
what should i mask the data? By true value you mean features and target?
Users usually don't share real data with us but somehow randomised dataset.
ok, sent it to you email
Hi @ohad-monday
I've just run your example exactly as you sent us (with less iterations and trials) and ran OK. I'm attaching the R file I used with your demo.csv (not included). Please, be sure to update Robyn to the latest version (3.6.0 - released today) and try again.
Note that now we set iterations and trials within robyn_run()
(you'll get a warning).
Do let us know if it works for you after updating.
issue_307.R.zip
@laresbernardo thank you, im checking right now. One thing though, did you change something in the modeling part to could cause to very poor accuracy of the models? i'm running it from the beginning of the pipeline, and the fitted models on the same data are terrible. did you change/add any param i need to add/change during the robyn_run?
hey, yes there're actually major improvement in optimisation, see here. Please also check out the new demo.R guide that introduces new functionalities and workflows. You should be also seeing new convergence message after the model runs. How many iterations are you running? Now for the simulated dataset it converges at 1.5-2k iterations.
new version - i ran it with 2500 iterations... something is completely off
refresh bug: i ran it again with the new version, still getting the same error: Finished in 4.04 mins
Running Pareto calculations for 3000 models on 3 fronts... Error in RNGseq(n, seed, ..., version = if (checkRNGversion("1.4") >= : NMF::createStream - invalid value for 'n' [positive value expected] Calls: robyn_refresh ... robyn_pareto -> %dorng% -> do.call -> doRNGseq -> RNGseq In addition: Warning message: In check_calibconstr(calibration_constraint, OutputModels$iterations, : calibration_constraint set for top 10% calibrated models. 300 models left for pareto-optimal selection. Minimum suggested:
Can you share this plot with us: OutputCollect$OutputModels$convergence$moo_distrb_plot
@gufengzhou I don't see a this plot here:
@ohad-monday can you please check on OutputModels$convergence$moo_distrb_plot
instead? Is that a
the product of a robyn_refresh() or robyn_run()?
@laresbernardo sure! (it's from the robyn_run()) here it is:
Thanks for sharing. as you see, model hasn't converged, if you compare this with the example plot here. Esp. NRMSE hasn't moved from the right side and it reflects exactly the bad fitting plot you showed before. I recommend you to run higher iterations to try out. You can do for example 5k with 1 trial first to see if 5k converges. The reason for this change of convergence speed is that we've added lambda as extra hyperparameter to enable its automatic optimum selection. We'll fine-tune it over time to improve convergence speed.
FYI I've just committed a fix that should accelerate convergence. You should be seeing the "hills" in the plot moving left earlier compared to before. Let us know if it works.
@gufengzhou Found the reason you weren't able to reproduce the error i got. the problem is with a model with a 0 coefficient that also has calibration data for this channel. might be somehow related to the mape lift calculation for the nevegrad optimization?
Uhh, bummer. Would you be able to send me (laresbernardo @gmail.com) a CSV with anonymized data and the .R file you're using to run Robyn so I can replicate exactly your issue? It'd be really useful to debug this error in case you're not able to do it yourself and make a pull request. I'd be happy to check given that Gufeng will be on leave for some months.
@laresbernardo thanks! i shared an updated params.R in email with calibration data. the refresh supposed to end with an error. than if you will exclude the f6 feature from the calibration data, it will work.
Hi @laresbernardo @kyletgoldberg I also use the new version of Robyn, and found calibration result very poor. My actual and predicted result looks very similar to this . I tried increasing iterations to 5000 ~ 7000, but the it won't converge still. What's even more strange to me is that, the trained R2 is negative. Is there any update over this issue?
Correct me if I'm wrong, my understanding is that previous version 3.5.0 choose best lambda of Ridge regression using 10 fold cv on window period. What has changed in the method to tune lambda in 3.6.0 version? and why do you decide to make the change?
Thanks very much for your help! I'm really learning a lot from your product development and discussion here!
@JiaMeihong could you check to see if any of the calibration data you are using is also corresponding to a channel that was assigned a 0 coefficient? Seems like that was causing the issue for @ohad-monday so that would be a good place to start.
@kyletgoldberg Hi, thanks for your reply! I've checked the coeffecient in "pareto_aggregated.csv". The channels of interest don't have a 0 coefficient. Though after iterations it seems that results are still far from convergence, all channels in my model result have positive coef.
Thanks for taking a look - how many different calibration inputs are you using? this is a difficult one to replicate on our end since we don't have data to recreate, but if you don't have too many calibrations, would it be possible to take them out one at a time and see if it is one particular calibration input causing an issue? I suspect some interaction between the calibration and the lambda hyperparameter optimization is causing the issue
@kyletgoldberg Thanks for your suggestion! I've tried input just one media one testing period into calibration, but the result is still not working out.
My calibration data is by month, but my input data is by week. Does that possibly impact the result?
Could you briefly explain what has changed in this Robyn version 3.6.0 about calibration? Cus when using previous version with calibration, the result worked fine. I think by understanding the logic of change may help me to detect the bug further.
Thanks!
@JiaMeihong That shouldn't be an issue - nothing changed in 3.6 with respect to calibration, but in 3.6 we added lambda hyperparameter to the nevergrad optimization rather than using the CV method. This should lead to better results by allowing more flexibility in learning the hyperparameters, but also seems to be leading to some convergence issues when paired with calibration data at times.
It seems like what is essentially happening here is that we are running into cases where the calibration data is so at odds with how the model wants to fit that it is never going to converge, which is a difficult problem to solve.
I saw you had edited a comment that had initially said that removing one of the calibrations with the lowest lift had got the model to work again - was that the case? It could be worth digging into how exactly the test was set up vs. how the data is collected for the channel - i.e. does the test encompass all of the spend you are measuring? etc. to ensure that they are as aligned as possible. Would be great to understand any info you have on this part to keep working to figuring this out. Thanks for your patience!
@kyletgoldberg Thanks very much for your detailed explanation! For the lambda part, I understand the version difference now. So sorry for confusion caused by my previous comment (already deleted), it turns out I have not include calibration properly, so please ignore my previous comment.
What I found for now is, yes there's a converging issue when I try to add calibration, even just one channel of one testing period won't converge in my case; and I also played with another model without calibration, but a lot more media channels. The latter case also haven't converge after many iterations.
I got into a headache of dying kernel when iterations are too many, and thus feel that simply increasing iterations in hope of achieving convergence may not be a good idea. Would appreciate it very much if there're other workarounds to get convergence.
@JiaMeihong do you mind sharing what you get when you run the following code after the non-convergent models in both calibration and non-calibration cases?
OutputModels$convergence$moo_distrb_plot
OutputModels$convergence$moo_cloud_plot
How many media channels are you including in that second case? We generally recommend to have at least 10 observations per independent variable so that may also be adding some difficulty. Thanks again for your patience
@kyletgoldberg Hi, please find the following plot. with 2500 iterations and 10 trials, no metrics has converged.
How many media channels are you including in that second case?
I have added 7 channels in total, and each have more one year weekly data. Though I specify it to just learn the most recent one year in the window function.
@JiaMeihong could you also run that without the calibration and share how it looks?
@JiaMeihong can you give us a bit more context on your calibration inputs?
dep_var
)? @laresbernardo Thanks for your reply!!
Are those experiments measuring the same KPI you are modeling (dep_var)?
Yes
Is the spend on that experiment similar to the spend of that media channel date range?
there's fluctuation in spending, but I tried to transform the AB test result by multiplying it by the ratio of average spending/ actual spending.
How confident are you of the incremental results measured?
Most of the results are from AB test, so it should be trustable.
Even without calibration, in another model where I try to include more channels (7 in total) to the model, it fails to converge as well.
If we run a model to convergence, does the Pareto Front determination exclude solutions which are pre-convergence?
Hi @extrospective I guess the answer would be no, because we only compare the last 5% of models with the first ones, regardless of the number of iterations. I guess you could back-engineer the values calculated for the last quantile to compare with previous models and check “when” you converged but I think that doesn’t make much sense because these values are all relative. Additionally, we only exclude solutions (models to be considered) that are NOT in the Pareto Front(s), regardless of the convergence. We don't have a pre and pos convergence reference.
We are running Robyn 3.6.2
We had been using with calibration successfully, but we changed our target variable and are now encountering exactly the error mentioned here.
Error in RNGseq(n, seed, ..., version = if (checkRNGversion("1.4") >= :
NMF::createStream - invalid value for 'n' [positive value expected]
Some(<code style = 'font-size:10p'> Error in RNGseq(n, seed, ..., version = if (checkRNGversion("1.4") >= : NMF::createStream - invalid value for 'n' [positive value expected] </code>)
Error in RNGseq(n, seed, ..., version = if (checkRNGversion("1.4") >= : NMF::createStream - invalid value for 'n' [positive value expected]
If there is any wisdom on what causes that we would appreciate it, as we're on a tight deadline to turn around this model.
At first I thought it was that we had too few iterations, so we increased to 1 trial x 6000 iterations.
Here is stdout. As you can see, we did not run the 20,000 total iterations suggested, partly because this was already 6 hours and we just wanted to see if the code ran before scaling up.
[1] "robyn_run started"
Warning in check_iteration(InputCollect$calibration_input, iterations, trials, :
You are calibrating MMM. We recommend to run at least 2000 iterations per trial and 10 trials to build initial model
Input data has 4411 days in total: 2010-02-01 to 2022-02-28
Initial model is built on rolling window of 365 day: 2021-03-01 to 2022-02-28
Using geometric adstocking with 40 hyperparameters (40 to iterate + 0 fixed) on 16 cores
>>> Starting 1 trials with 6000 iterations each with calibration using TwoPointsDE nevergrad algorithm...
Running trial 1 of 1
|
| | 0%
|
| | 1%
|
|= | 1%
(...)
|
|======================================================================| 99%
|
|======================================================================| 100%
Finished in 385.25 mins
Using robyn object location: output
Provided 'plot_folder' doesn't exist. Using default 'plot_folder = getwd()': /dbfs/robyn_output/poc/order_count_new/US/2022-02-28
>>> Running Pareto calculations for 6000 models on 3 fronts...
@extrospective could you share your session info please? i've seen this error once before when a user was running on GCP, are you running locally or on a cloud based solution?
we are running in databricks
Databricks runtime 10.2
Here is the r version info:
$platform
[1] "x86_64-pc-linux-gnu"
$arch
[1] "x86_64"
$os
[1] "linux-gnu"
$system
[1] "x86_64, linux-gnu"
$status
[1] ""
$major
[1] "4"
$minor
[1] "1.2"
$year
[1] "2021"
$month
[1] "11"
$day
[1] "01"
$`svn rev`
[1] "81115"
$language
[1] "R"
$version.string
[1] "R version 4.1.2 (2021-11-01)"
$nickname
[1] "Bird Hippie"
We pulled from the git repo 3.6.2 roughly 2 weeks ago. checking now for details
r 4.1.2 is the maximum available in databricks runtimes at this time.
@extrospective would it be possible to try running locally to see if you get the same error? i don't think you need to run as many trials we should be able to see whether it works or not with fewer. I suspect this may be an issue with Robyn having some issues with cloud platforms so that would be a good one to rule out if we can
We will try some further tests. not sure why "cloud platforms" should be a source of error, but as cloud platforms may have different versions of libraries, I think a library comparison between what works for you and what does not work for us is helpful.
Our next runs include:
good point. sessionInfo()
should provide an output of all the libraries and their current versions if you could share that, I can compare that with what's working on our machines while you're working through those other runs.
I am not sure the April 12 commit is in our copy. The prior commits should be in our copy of Robyn.
This is the sessionInfo() for our run. We especially wanted to check rngtools and doRNG versions, but these seem okay. So we are going to investigate data or logic on our end.
R version 4.1.2 (2021-11-01)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04.4 LTS
Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0
locale:
[1] LC_CTYPE=C.UTF-8 LC_NUMERIC=C LC_TIME=C.UTF-8
[4] LC_COLLATE=C.UTF-8 LC_MONETARY=C.UTF-8 LC_MESSAGES=C.UTF-8
[7] LC_PAPER=C.UTF-8 LC_NAME=C LC_ADDRESS=C
[10] LC_TELEPHONE=C LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] Robyn_3.6.2 reticulate_1.24 testit_0.13 SparkR_3.2.0
[5] scales_1.1.1 patchwork_1.1.1 ggplot2_3.3.5 stringr_1.4.0
[9] data.table_1.14.2 R.utils_2.11.0 R.oo_1.24.0 R.methodsS3_1.8.1
[13] readr_2.1.0 dplyr_1.0.7 plyr_1.8.7 rlist_0.4.6.2
loaded via a namespace (and not attached):
[1] httr_1.4.2 tidyr_1.2.0 jsonlite_1.8.0 splines_4.1.2
[5] foreach_1.5.2 here_1.0.1 RcppParallel_5.1.5 assertthat_0.2.1
[9] lares_5.1.2 doRNG_1.8.2 yaml_2.3.5 pillar_1.6.4
[13] lattice_0.20-45 glue_1.5.0 pROC_1.18.0 digest_0.6.28
[17] rvest_1.0.2 colorspace_2.0-3 htmltools_0.5.2 Matrix_1.3-4
[21] pkgconfig_2.0.3 purrr_0.3.4 openxlsx_4.2.5 TeachingDemos_2.10
[25] tzdb_0.2.0 tibble_3.1.6 generics_0.1.1 ellipsis_0.3.2
[29] withr_2.5.0 lazyeval_0.2.2 survival_3.2-13 magrittr_2.0.1
[33] crayon_1.5.1 Rserve_1.8-10 fansi_0.5.0 doParallel_1.0.17
[37] xml2_1.3.3 hwriter_1.3.2 tools_4.1.2 hms_1.1.1
[41] minpack.lm_1.2-1 lifecycle_1.0.1 rpart.plot_3.1.0 munsell_0.5.0
[45] glmnet_4.1-3 prophet_1.0 rngtools_1.5.2 zip_2.2.0
[49] compiler_4.1.2 rlang_0.4.12 grid_4.1.2 RCurl_1.98-1.6
[53] nloptr_2.0.0 ggridges_0.5.3 iterators_1.0.14 rPref_1.3
[57] rappdirs_0.3.3 igraph_1.3.0 bitops_1.0-7 gtable_0.3.0
[61] codetools_0.2-18 DBI_1.1.1 R6_2.5.1 hwriterPlus_1.0-3
[65] lubridate_1.8.0 fastmap_1.1.0 utf8_1.2.2 rprojroot_2.0.3
[69] h2o_3.36.0.4 shape_1.4.6 stringi_1.7.6 parallel_4.1.2
[73] Rcpp_1.0.8.3 vctrs_0.3.8 rpart_4.1-15 png_0.1-7
[77] tidyselect_1.1.1
One change which was made on our end from a prior run was the addition of SparkR, so we will also test if that library is the source of an issue. --> We were wrong about this, sparkR has been in the session all along (see next comment).
Databricks has sparkR in an empty notebook by default. I note that there are some name conflicts with the tidyverse in sparkR.
We showed that if we turn calibration off, we do not encounter the error mentioned above. We were unable to easily remove sparkR from the databricks test. We will further examine whether anything about our data or its calibration contributed to the error.
[From this research we have learned about library doRNG and rngtools which can be used to trace through errors (for those who encounter this error in the future and want to investigate more rapidly.)]
I suspect there is error in liftAbs in the calibration data -> found some NAs
I assume robyn_run() would have noticed this earlier then at the final pareto generation step; as it's input to objective function. Can checks be added for NA?
But first, I'll verify this is the source of the problem.
We have confirmed this error for us as a data issue. The liftAbs was NA for every row in the code which triggered this error, and once corrected the problem did not occur.
Based on this, I might recommend:
This assertion would then avoid an unusual and weird error.
And then I think this ticket might be closed with a hypothesis that this same issue caused the other errors reported with this same symptom.
Thanks for confirming @extrospective - we will add a check for that and then close the issue out with that fix.
Thanks for the feedback and for checking the source of this issue @extrospective You're recommendation has been implemented! Feel free to close this ticket if you consider it's been fixed
I do not have the close ticket available to me. I recommend closing.
Project Robyn
Describe issue
issue - when running robyn_refresh with more than 22 incremental days i get this error below
error message:
Environment & Robyn version
R version : R version 4.0.3 (2020-10-10) Robyn version: Robyn_3.4.8