AllenDowney / ThinkBayes2

Text and code for the forthcoming second edition of Think Bayes, by Allen Downey.
http://allendowney.github.io/ThinkBayes2/
MIT License
1.8k stars 1.49k forks source link

Chapter 20 Counting Cells Measurement Error Specification #54

Closed gbrunkhorst closed 2 years ago

gbrunkhorst commented 2 years ago

I love your work and own a hard copy of Think Bayes 2! I know you got the cell counting example from another fantastic content creator. I have been pondering inferred vs. known error in my models and I think that the pymc cell counting model can be improved for posterity. Mainly, I think that there are a bunch of parameters that are inferred in pymc3 that should be fixed. For example:

with pm.Model() as model:
    yeast_conc = pm.Normal("yeast conc", 
                           mu=2 * billion, sd=0.4 * billion)

    shaker1_vol = pm.Normal("shaker1 vol", 
                               mu=9.0, sd=0.05)
    shaker2_vol = pm.Normal("shaker2 vol", 
                               mu=9.0, sd=0.05)
    shaker3_vol = pm.Normal("shaker3 vol", 
                               mu=9.0, sd=0.05)

should be:

import theano
shaker_vol_mu = theano.shared(9.0)
shaker_vol_sd = theano.shared(0.05)

with pm.Model() as model:
    yeast_conc = pm.Normal("yeast conc", 
                           mu=2 * billion, sd=0.4 * billion)

    shaker1_vol = pm.Normal("shaker1 vol", 
                               mu=shaker_vol_mu, sd=shaker_vol_sd)
    shaker2_vol = pm.Normal("shaker2 vol", 
                               mu=shaker_vol_mu, sd=shaker_vol_sd)
    shaker3_vol = pm.Normal("shaker3 vol", 
                               mu=shaker_vol_mu, sd=shaker_vol_sd)

I'm interpreting the mean and sd of yeast conc as prior values to be inferred, and the mean and sd of shaker volumes as fixed values based on information from outside of the model inference (e.g., the values listed on the package of a pipette or something). I think this is also consistent with ABC portion of the analysis present just after. Other lines of code that would be added include the following:

shaker_transfer_mu = theano.shared(1.0)
shaker_transfer_sd = theano.shared(0.01)

chamber_vol_mu = theano.shared(0.0001)
chamber_vol_sd = theano.shared(0.0001/20)

associated pm.Normal() calls would be updated.

My motivation for writing is the amount of time it took for me to figure out how and when to specify fixed vs. inferred parameter values in my models. Thanks!

AllenDowney commented 2 years ago

Hi Greg. Thanks for this suggestion. Based on my mental model of how PyMC works, I am not convinced yet that these changes will make a difference, but I might be mistaken. Have you run a minimal example with and without this change, and are the results actually different?

gbrunkhorst commented 2 years ago

Allen - I worked this example further and got a tip from Christian Luhmann on Discourse and my thoughts are now laid out in this notebook. I think you will be able to see what I'm thinking by scanning the visuals. I ran my analysis on a simplified model, but following the pattern, I think that this code:

with pm.Model() as model:
    yeast_conc = pm.Normal("yeast conc", 
                           mu=2 * billion, sd=0.4 * billion)

    shaker1_vol = pm.Normal("shaker1 vol", 
                               mu=9.0, sd=0.05)
    shaker2_vol = pm.Normal("shaker2 vol", 
                               mu=9.0, sd=0.05)
    shaker3_vol = pm.Normal("shaker3 vol", 
                               mu=9.0, sd=0.05)

should be:

import theano.tensor as tt
rand_stream = tt.random.utils.RandomStream()

with pm.Model() as model:
    yeast_conc = pm.Normal("yeast conc", 
                           mu=2 * billion, sd=0.4 * billion)

    shaker1_vol = pm.Deterministic("shaker1 vol", 
                                rand_stream.normal(9.0, 0.05))
    shaker2_vol = pm.Deterministic("shaker2 vol", 
                                rand_stream.normal(9.0, 0.05))
    shaker3_vol = pm.Deterministic("shaker3 vol", 
                                rand_stream.normal(9.0, 0.05))

if the intent is for the shaker volumes to be fixed distributions (i.e., regardless of the observed data). Let me know if I can help develop this idea further.

Greg

AllenDowney commented 2 years ago

Hi Greg,

I think I see the distinction you are making: in the first case, the model specifies a prior distribution for shaker1_vol (for example) and generates samples from its posterior distribution. In the second case, the distribution of shaker1_vol is considered known and does not get updated based on the data.

If I'm interpreting that correctly, I think the first model is what I want to use. It might seem strange that the data should change our opinion about the volumes of the shakers, but it might help to think of it this way: "Before seeing the data, I knew that the actual volume in the shakers could vary. After seeing the data, I have some information about where in the range of variation the volumes happened to fall during this particular experiment." For example, in a sample where yeast_conc is actually low, and the measurements are relatively high, I would suspect that the volumes were on the high side.

Allen

gbrunkhorst commented 2 years ago

Yes, you see the distinction I'm trying to make! Thank you. I'll close this issue and leave my code on Discord for posterity. In my work in environmental engineering, the second model is more common (e.g., the laboratory reported the concentration is x, and their qaqc program says they are accurate to +/- s). But I can see your point and it depends on the scientist. The beauty of open science is that the assumptions are transparent!

AllenDowney commented 2 years ago

All good. Thanks!