CamDavidsonPilon / Probabilistic-Programming-and-Bayesian-Methods-for-Hackers

aka "Bayesian Methods for Hackers": An introduction to Bayesian methods + probabilistic programming with a computation/understanding-first, mathematics-second point of view. All in pure Python ;)
http://camdavidsonpilon.github.io/Probabilistic-Programming-and-Bayesian-Methods-for-Hackers/
MIT License
26.67k stars 7.86k forks source link

Ch 1: off by 1 error between pymc2 and 3 #338

Closed taylorterry3 closed 6 years ago

taylorterry3 commented 7 years ago

I absolutely love this project, and my team is now using it as a professional development module. We're using pymc3, and got a little puzzled by the last exercise in Ch 1 which asks you to "consider all instances where tausamples < 45" as all of our values of tau are less than 45. Comparing the pymc2 and pymc3 versions the max value of tau is 45 in 2 and 44 in 3, with the distributions appearing the same other than being shifted by 1. Changing the >= to a > in the line `lambda = pm.math.switch(tau >= idx, lambda_1, lambda_2)` in the pymc3 version brings them back into alignment.

Both versions state tau like this:

tau = pm.DiscreteUniform("tau", lower=0, upper=n_count_data)

n_count_data is 74, so tau can have any integer value 0-74 inclusive. To reason about this it helps to think of the values of tau as representing the boundaries between the days and not the days themselves, so there is always going to be one more than there are days.

The pymc2 version sets up lambda_ like this:

@pm.deterministic
def lambda_(tau=tau, lambda_1=lambda_1, lambda_2=lambda_2):
    out = np.zeros(n_count_data)
    out[:tau] = lambda_1  # lambda before tau is lambda1
    out[tau:] = lambda_2  # lambda after (and including) tau is lambda2
    return out

While pymc3 does this:

with model:
    idx = np.arange(n_count_data) # Index
    lambda_ = pm.math.switch(tau >= idx, lambda_1, lambda_2)

So when tau is zero, the pymc2 version executes out[:0] which returns an empty array. In real world terms, this is the case where all events happened under the lambda_2 regime. pymc3 by contrast executes pm.math.switch(0 >= idx, lambda_1, lambda_2) and returns lambda_1 for the first element of idx because 0 >= 0 is True. The comparison operator needs to be > so that you can evaluate 0 > 0 and get False for that and all other elements of idx for the case where you're always in the lambda_2 regime.

I think I have all this right, but I'm a bit new to all of this so I wanted to lay out the reasoning before I put in a random PR to change a >= to a >. Thanks for reading, and let me know if this all makes sense.

taylorterry3 commented 7 years ago

To add to this, there also appears to be a small problem with the prior distribution for tau.

My read of the pymc code is that pm.DiscreteUniform() generates values from lower to upper plus one. See https://github.com/pymc-devs/pymc/blob/0c9958838014e2b5693c56ebd4fc32a96632f189/pymc/distributions.py#L2699

So in Ch 1 tau should be:

tau = pm.DiscreteUniform("tau", lower=0, upper=n_count_data - 1)

This is consistent with the pymc tutorial on switchpoint analysis at https://pymc-devs.github.io/pymc/tutorial.html, where len(disasters_array) is 111 but upper=110:

switchpoint = DiscreteUniform('switchpoint', lower=0, upper=110, doc='Switchpoint[year]')

The first comment still applies and is the main issue here. This bit is all but harmless to the analysis, I just happened to notice it when comparing this example to the pymc tutorial. Happy to PR both if it's agreed that both are issues.

CamDavidsonPilon commented 7 years ago

Hey @taylorterry3, good catches. This is very astute of you. I think both issues are worth PR-ing. (I can't think of a better name for these bugs then literal edge cases!)

taylorterry3 commented 7 years ago

I see what you did there :)

Will PR when I get a moment.

A slight addendum on the second issue now that I've thought about it during daylight hours. pm.DiscreteUniform() generates values from lower to upper, not upper plus one; the line I linked to with the +1 is just correcting for the fact that np.random.randint(0, n) only yields values up to n-1, which is consistent with the rest of numpy.

However, it's not clear that we want len(n_count_data) + 1 values of tau. While it's true that the switchpoints do reflect boundaries between the days, on further reflection it seems to me that the tau = 0 and tau = 74 switchpoints are actually functionally identical. They both represent all events occurring under the same regime; they only represent something different if the two lambdas have different priors. In this case they have the same prior because they both use the same alpha value.

At any rate, these are extremely edgy cases that bear more on theory than practice. I'm going to go ahead and PR a change to have tau as below sheerly for the sake of consistency with the tutorial in the pymc repo, which has been there for six years without generating objection from nitpickers like me.

tau = pm.DiscreteUniform("tau", lower=0, upper=n_count_data - 1)
taylorterry3 commented 7 years ago

Just realized I let this slip. Do you want a one-line PR of direct surgery on the .ipynb file or should I re-run everything in the notebook to reflect the change and go ahead and generate the monster diff?

CamDavidsonPilon commented 7 years ago

good question - ideally it's a one-liner, but that is prone to json errors. Maybe give it a shot?

geoffharcourt commented 6 years ago

I think this is resolved in #353?