facebookexperimental / Robyn

Robyn is an experimental, AI/ML-powered and open sourced Marketing Mix Modeling (MMM) package from Meta Marketing Science. Our mission is to democratise modeling knowledge, inspire the industry through innovation, reduce human bias in the modeling process & build a strong open source marketing science community.
https://facebookexperimental.github.io/Robyn/
MIT License
1.08k stars 322 forks source link

Unfortunate channel naming can lead to mixed up hyperparameters in budget allocator #819

Open m4x3 opened 9 months ago

m4x3 commented 9 months ago

Issue

When running the budget allocator on our internal data, the gamma / inflexions parameters of some channels got mixed up due to unfortunate variable naming and implicit sorting logic in the code.

Internally, one of our media channels is called _fb_valueopt (FB value opt campaign) and another is called _fb_value_optads (FB value opt campaign that also optimizes for ad revenue).

In the budget allocator the hill parameters are fetched from the model results here.

The get_hill_params() function is defined here.

In this function is a part where the inflexion points are calculated. The following code assumes that the chnAdstocked columns are sorted identically as the gammas vector.

inflexions <- unlist(lapply(seq(ncol(chnAdstocked)), function(i) {
    c(range(chnAdstocked[, i]) %*% c(1 - gammas[i], gammas[i]))
  }))

However, as the below example shows, this is unfortunately not the case if the variables are named as above. In our case, this meant that inflexion points for these campaigns were calculated wrong which had a very drastic impact on the budget allocator. No error was raised, we only identified this issue because we independently ran the robyn_response on these channels and got back different results.

Provide reproducible example

sort(c("fb_value_opt", "fb_value_opt_ads"))
[1] "fb_value_opt"     "fb_value_opt_ads"
sort(c("fb_value_opt_gammas", "fb_value_opt_ads_gammas"))
[1] "fb_value_opt_ads_gammas" "fb_value_opt_gammas"    

Potential fix

In our case, we fixed the issue by changing the sorting of the gammas vector inside the get_hill_params() function. We also changed the sorting of the alphas vector to be safe.:

...
names(gammas) <- stringr::str_remove(names(gammas),"_gammas")
gammas <- gammas[names(chnAdstocked)]
names(gammas) <- paste0(names(gammas), "_gammas")

names(alphas) <- stringr::str_remove(names(alphas),"_alphas")
alphas <- alphas[names(chnAdstocked)]
names(alphas) <- paste0(names(alphas), "_alphas")

inflexions <- unlist(lapply(seq(ncol(chnAdstocked)), function(i) {
    c(range(chnAdstocked[, i]) %*% c(1 - gammas[i], gammas[i]))
  }))
...

There may be more elegant solutions to this.

I know this may rather be an edge case problem due to our channel naming, but it had quite a drastic impact on our results.

gufengzhou commented 9 months ago

thanks very much for raising this! we didnt consider this case and will include a fix soon. Have you observed similar sorting issues for the modelling itself? If you run the model with "standard naming" and same setting, do you get the same results? I'm trying to get a sense if I need to check more places regarding sorting.

m4x3 commented 9 months ago

Hi! Thanks for looking into this. I now ran different iterations to check on the consistency of the results:

  1. Running the same model with "standard naming", i.e. no varname is a substring of another varname.
  2. Running the same model with "unfortunate naming", i.e. some varnames are substrings of other varnames.
  3. Running the same model with "standard naming" but with minor changes to the initial varnames.

Version 1 and 2 produce different model results. This indicates that the sorting issue also occurs somewhere in the modeling and/or output generation process.

Version 1 and 3 produce identical results. This was expected, but I wanted to rule out that changing names, in general, can produce different results.

For now, I will continue using standard names to rule out any issues for my project.