AllenDowney / ThinkBayes2

Text and code for the forthcoming second edition of Think Bayes, by Allen Downey.
http://allendowney.github.io/ThinkBayes2/
MIT License
1.8k stars 1.49k forks source link

Key result in chapter 10 sensitive to jittering #9

Open alexklibisz opened 6 years ago

alexklibisz commented 6 years ago

This issue pertains to Chapter 10 and its source code in variability.py, which estimates distributions for the mean and standard deviation of male and female heights, then uses the distributions to compute distributions for the coefficient of variation for males and females. A key result seems to be that the coefficient of variation for females is higher than that of males. However, if you remove the jittering that gets applied to the original heights, this result seems to be reversed.

variability.py line 462 applies "jittering" to the list of heights.

I also modified line 266 to print the label for the posterior mean being printed.

If you run the script with jittering, you see that the coefficient of variation for females is greater than that of males, which matches the book's result.

$ python variability.py
...
female CV posterior mean 0.04379422911488041
male CV posterior mean 0.04151490569938492
...
female bigger 1.0000000000000628
male bigger 0

The resulting plot also matches that the book:

image

Now if you comment-out line 462 (the jittering), and re-run the script, you see that the mean coefficient of variation is non-negligibly higher for males.

$ python variability.py
...
male CV posterior mean 0.042135070189436574
female CV posterior mean 0.039877437544664336
...
female bigger 0
male bigger 1.0000000000000615

The resulting plot reflects this result. image

My instinct is to trust the second result, as it uses the data in its raw form. Still, it would be nice to understand how this simple jittering can cause such a drastic difference in the coefficient of variation.

I'll post back if I can think of any solution or explanation to this problem.

AllenDowney commented 6 years ago

Interesting. I will investigate as soon as I can, but it might be a little while.

Both distributions have some strange outliers, which have a disproportionate effect on the estimated CV. I might investigate whether something is going on there.

Thanks for raising the issue.

manujchandra commented 5 years ago

Hi,

Talking about jittering, why do we jitter in the first place? What is the use of jittering?