Matt52 / bayesian-testing

Bayesian A/B testing
MIT License
72 stars 9 forks source link

Results are different from online tool #19

Closed ThomasMeissnerDS closed 1 year ago

ThomasMeissnerDS commented 1 year ago

Hi,

I tested your library and cross-checked against this online calculator: Here is the result from your library:

[{'variant': 'True True True False False False False',
  'totals': 1172,
  'positives': 461,
  'positive_rate': 0.39334,
  'prob_being_best': 0.7422,
  'expected_loss': 0.0582635},
 {'variant': 'False True True False False False False',
  'totals': 222,
  'positives': 27,
  'positive_rate': 0.12162,
  'prob_being_best': 0.0,
  'expected_loss': 0.3280173},
 {'variant': 'False False True False False False False',
  'totals': 1363,
  'positives': 63,
  'positive_rate': 0.04622,
  'prob_being_best': 0.0,
  'expected_loss': 0.4051768},
 {'variant': 'False False False False False False False',
  'totals': 1052,
  'positives': 0,
  'positive_rate': 0.0,
  'prob_being_best': 0.0,
  'expected_loss': 0.4512031},
 {'variant': 'True False True False False False False',
  'totals': 1,
  'positives': 0,
  'positive_rate': 0.0,
  'prob_being_best': 0.2578,
  'expected_loss': 0.1997566}]

So the best variant has 74% probability to be the winner. On the online calculator it is 63.48% instead (last variant is 36.52% instead of 25.78%).

I used the BinaryDataTest() without any priors.

I did not dig deeper on what might be right here, but wanted to drop this as feedback.

ThomasMeissnerDS commented 1 year ago

This difference only occurs in cases with more than 2 groups.

Matt52 commented 1 year ago

Hey, thanks for a question. First instant explanation I could think of are the prior differences. In your last variant 'True False True False False False False' (btw interesting naming 🙂) you have only 1 observation. Therefore the prior effect will be very strong, and honestly I have no idea what is the prior setup in the calculator you shared. There is an option to setup custom prior for each variant when adding data (using a_prior and b_prior parameters), so you can try to tune it to get similar results to this online calculator (but it might be difficult as their code is likely not visible). You will also see that with more data, the difference is actually very small (as prior effect is reduced). See for instance for test like this:

from bayesian_testing.experiments import BinaryDataTest
test = BinaryDataTest()
test.add_variant_data_agg("A", totals=1172, positives=461)
test.add_variant_data_agg("B", totals=222, positives=27)
test.add_variant_data_agg("C", totals=1363, positives=63)
test.add_variant_data_agg("D", totals=1052, positives=0)
test.add_variant_data_agg("E", totals=100, positives=40)
test.evaluate()
abrunner94 commented 1 year ago

I think that site mentions they use 300000 sample simulations. Have you tried adjusting it?

Matt52 commented 1 year ago

Yes, you could run with more simulations as:

test.evaluate(sim_count=300000)

But I still think the difference is caused by prior difference. In this package, the default prior for this test is Beta(a=0.5,b=0.5) (which correspond to a non-information or Jeffreys prior).

If I tune the example of @ThomasMeissnerDS and set it to a=b=1 (i.e. using uniform distribution priors) like this:

from bayesian_testing.experiments import BinaryDataTest
test = BinaryDataTest()
test.add_variant_data_agg("A", totals=1172, positives=461, a_prior=1, b_prior=1)
test.add_variant_data_agg("B", totals=222, positives=27, a_prior=1, b_prior=1)
test.add_variant_data_agg("C", totals=1363, positives=63, a_prior=1, b_prior=1)
test.add_variant_data_agg("D", totals=1052, positives=0, a_prior=1, b_prior=1)
test.add_variant_data_agg("E", totals=1, positives=0, a_prior=1, b_prior=1)
test.evaluate(sim_count=300000)

Then the result is closer to the calculator:

[{'variant': 'A',
  'totals': 1172,
  'positives': 461,
  'positive_rate': 0.39334,
  'prob_being_best': 0.6284,
  'expected_loss': 0.075503},
 {'variant': 'B',
  'totals': 222,
  'positives': 27,
  'positive_rate': 0.12162,
  'prob_being_best': 0.0,
  'expected_loss': 0.3441693},
 {'variant': 'C',
  'totals': 1363,
  'positives': 63,
  'positive_rate': 0.04622,
  'prob_being_best': 0.0,
  'expected_loss': 0.4221505},
 {'variant': 'D',
  'totals': 1052,
  'positives': 0,
  'positive_rate': 0.0,
  'prob_being_best': 0.0,
  'expected_loss': 0.4680319},
 {'variant': 'E',
  'totals': 1,
  'positives': 0,
  'positive_rate': 0.0,
  'prob_being_best': 0.3716,
  'expected_loss': 0.1339427}
]

So I suppose the prior distribution for conversion in the online calculator is Beta(1,1) (i.e. uniform distribution).

PlatosTwin commented 1 year ago

@abrunner94 and @ThomasMeissnerDS, I've forked and added a fair number of features to this package, rebranded and released as bayes_ab, and I just want to corroborate the observations above, namely, that a using greater number of samples and the Bayes-Laplace prior—beta(1, 1)—seems indeed to generate results that (very closely) match those of the online calculator.

One issue not mentioned in the above exchange, but addressed in bayes_ab, is that the positive rate here is calculated incorrectly, or at least misleadingly. The positive rate most useful to the end user, I think, is the mean of the posterior distribution, whereas the positive rate in v0.3.0 (above) is simply conversions divided by totals.

Matt52 commented 1 year ago

@PlatosTwin thanks for your comments. And I am happy to see that you picked this little project and used it as a base for yours.

Regarding positive rate, I would not necessarily agree it is incorrect but rather a decision for an implementation. The idea was just to show something like an actual "conversion rate" (which is very common to look at together with these probabilities). I am just calling it "positive rate" to sound more generic (as this package is not designed to be used only for typical conversion rate AB testing).