NAVADMC / ADSM

A simulation of disease spread in livestock populations. Includes detection and containment simulation.
Other
10 stars 5 forks source link

Graphing the Probability Density Functions #331

Open josiahseaman opened 9 years ago

josiahseaman commented 9 years ago

Probably more difficult because we need to accommodate for each equation type.

330 again.

I think this issue requires me to match every equation type option to a function in matplotlib. If that's the case, it'll take me a while compared to Relational Functions

josiahseaman commented 9 years ago

http://docs.scipy.org/doc/scipy-0.15.1/reference/generated/scipy.stats.fisk.html

ndh2 commented 9 years ago

I'm filling in SciPy functions for the various PDF types over on the wiki.

The first column (NAADSM XML) is only relevant for the xml2sqlite function. The last two columns are the ones that will be useful for this task.

josiahseaman commented 9 years ago

Thanks! It looks like you're doing the hard work of this issue. Since you're hunting down the SciPy methods in order to get their parameters I will be able to copy your work and add the python context. It should be pretty straightforward on my end. Which means I could get to some other issues in time. Thank you so much for doing this.

missyschoenbaum commented 9 years ago

You guys rock! I see there is some documentation I need to update at the bottom. I have the editing version of the pdf overview, but haven't attempted getting it on the wiki.

ndh2 commented 9 years ago

@josiahseaman the table of SciPy functions is filled in now. The fixed value, histogram, inverse Gaussian, Pearson V, and piecewise will have to be special cases since they're not offered in SciPy.

josiahseaman commented 9 years ago

Thanks Neil. Could you walk me through an example of these expressions?

d = (min + 4*mode + max)/6

alpha = 6*((d - min)/(max - min))

beta = 6*((max - d)/(max - min))

scipy.stats.beta(
 alpha,
 beta,
 loc=min,
 scale=max-min
)

I see here you've given the formula for calculating d but I don't see it in the parameters passed to scipy.stats.beta. I don't see the parameter mentioned in Scipy docs. Most of the rest of these it looks like I can just plugin the values to the function.

josiahseaman commented 9 years ago

The row I'm referring to is Beta-Pert, so maybe I'm looking at the wrong function. I am seeing other people online mention scipy.stats.beta.cdf(x, a, b, loc=0, scale=1) (Cumulative density function) in reference to Beta-Pert. Is this right? Is d = (min + 4*mode + max)/6 actually x parameter?

ndh2 commented 9 years ago

'd' is just an intermediate value calculated from the 3 input parameters min, mode, and max. Calculating d makes the formulas for alpha and beta a bit shorter.

ndh2 commented 9 years ago

As an example if the input parameters are min=2, mode=5, max=6, then

d = (2+45+6)/6 = 4.66667 alpha = 6((4.66667-2)/(6-2)) = 4 beta = 6*((6-4.66667)/(6-2)) = 2

And now you have all the values needed to instantiate a beta, so pdf = scipy.stats.beta(4,2,2,4)

josiahseaman commented 9 years ago

Oh, I'm sorry I didn't notice it was used later in the formula. So does this precalculation mean that's it's okay that Beta and Beta-Pert are essentially calling the same Scipy method?

ndh2 commented 9 years ago

Yep, it's OK! A beta-pert is just a beta distribution. But the usual parameters to a beta distribution (alpha and beta) are a bit unintuitive, so people came up with a way to specify more intuitive parameters (min, mode, max) and transform those parameters to alpha and beta.

josiahseaman commented 9 years ago

I came to different conclusions than you about triangular based on this doc:

"The standard form is in the range [0, 1] with c the mode. The location parameter shifts the start to loc. The scale parameter changes the width from 1 to scale."

So I wrote: [scipy.stats.triang, {'loc':m.min, 'c':m.mode, 'scale': m.max}] where m is the database model. Can you double check this and say which one of us is correct and why?

ndh2 commented 9 years ago

I think it's me. :) Try this:

import scipy.stats
minn=1.0
mode=3.0
maxx=4.0
dist = scipy.stats.triang((mode-minn)/(maxx-minn),loc=minn,scale=maxx-minn)
[dist.pdf(x) for x in (0,1,1.5,2,2.5,3,3.5,4,5)]

You'll get 0 before and at the minimum (at x=0 and x=1), 0 at and after the maximum (x=4 and x=5), and the highest value at the mode, x=3.

josiahseaman commented 9 years ago

Thanks, I think I'll do testing for each in IPython Notebook so I can double check the parameter names for everything. The way I set it up, I'm not using positional arguments, if that turns into a problem, I can always change it. Here's what I have so far:

    d = (m.min + 4 * m.mode + m.max) / 6
    # log defaults to ln
    eq = {  # Compiled from:  https://github.com/NAVADMC/ADSM/wiki/Probability-density-functions  Thanks Neil Harvey!
            "Bernoulli": [scipy.stats.bernoulli, {'p': m.p}],
            "Beta": [scipy.stats.beta, {'a': m.alpha, 'b': m.alpha2, 'loc': m.min, 'scale': m.max - m.min}],  # TODO: can't confirm param names
            "BetaPERT": [scipy.stats.beta,
                         {'a': 6 * ((d - m.min) / (m.max - m.min)), 'b': 6 * ((m.max - d) / (m.max - m.min)), 'loc': m.min, 'scale': m.max - m.min}],
            "Binomial": [scipy.stats.binom, {'s': m.s, 'loc': m.p}],
            "Discrete Uniform": [scipy.stats.randint, {'a': m.min, 'b': m.max}],
            "Exponential": [scipy.stats.expon, {'scale': m.mean}],
            "Fixed Value": [],  # TODO: Not offered in SciPy? Use a narrow uniform instead
            "Gamma": [scipy.stats.gamma, {'a': m.alpha, 'scale': m.beta, 'loc': 0}],
            "Gaussian": [scipy.stats.norm, {'loc': m.mean, 'scale': m.std_dev}],
            "Histogram": [],  # TODO: Not offered in SciPy? by hand
            "Hypergeometric": [scipy.stats.hypergeom, {'M': m.m, 'n': m.d, 'N': m.n}],
            # I don't know, but these seem to be the order of args in scipy/stats/_discrete_distns.py:306
            "Inverse Gaussian": [],  # TODO: Shape/lambda parameter not supported in SciPy?
            "Logistic": [scipy.stats.logistic, {'loc': m.location, 'scale': m.scale}],
            "LogLogistic": [scipy.stats.fisk, {'c':m.shape, 'loc': m.location, 'scale': m.scale}],  # scipy/stats/_continuous_distns.py:683 http://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.fisk.html
            "Lognormal": [scipy.stats.lognorm, {'sigma': (sqrt(log((std_dev**2 + mean**2) / mean**2))), 
                                                'scale': exp(log(mean**2 / sqrt(std_dev**2 + mean**2)))}],  # I think exp(log()) is redundant
                                                # Double check this, there's a lot of math
            "Negative Binomial": [scipy.stats.nbinom, {'n': m.s, 'p': m.p}],
            "Pareto": [scipy.stats.pareto, {'theta': m.theta, 'scale': m.a}],
            "Pearson 5": [],  # TODO: Not offered in SciPy
            "Piecewise": [],
            "Poisson": [scipy.stats.poisson, {'k': m.mean}],
            "Triangular": [scipy.stats.triang, {'loc':m.min, 'c':(m.mode-m.min)/(m.max-m.min), 'scale': m.max-m.min}],
            "Uniform": [scipy.stats.uniform, {'loc': m.min, 'scale':m.max-m.min}],
            "Weibull": [scipy.stats.weibull_min, {'x':m.alpha, 'scale': m.beta}],
            }
    function = eq[kind][0]
    kwargs_dict = eq[kind][1]

    return pdf_graph(m.x_axis_units, function, kwargs_dict)

I think I'm going to have to do something funny to prevent expressions from evaluating when they're referencing fields from other equation types. It's coming along faster than I had hoped. Thanks for all your help.

josiahseaman commented 9 years ago

I'm getting there :)

image

I find something about the little graphs really cute.

missyschoenbaum commented 9 years ago

yeah! Almost half way!

josiahseaman commented 9 years ago

image

@ndh2 Is this right? All I'm giving it is a "p" of 0.5. I'm calling the Probability Mass Function, instead of the density function. So I'm not sure if this is a valid representation for the user.

josiahseaman commented 9 years ago

image

I'm thinking this is the correct output since Bernouli is just chosing between 1 and 0. So at 0.9 it's 1 nine tenths of the time. It just looked odd at 0.5.

I'm making up input variables, so I hope these are all reasonable and within bounds for tests:

m = empty()
m.mean = 0.5
m.min=1.0
m.mean
m.mode=3.0
m.max=4.0
m.std_dev = 0.5
m.alpha = 0.5
m.alpha2 = 0.5
m.beta = 0.5
m.location = 0.5
m.scale = 0.5
m.shape = 0.5
m.n = 2
m.p = 0.9
m.m = 3
m.d = 5
m.theta = 0.5
m.a = 0.5
m.s = 7
josiahseaman commented 9 years ago

image

josiahseaman commented 9 years ago

The scale is off on Fixed Value. It'll say p = 10,000.0 if I set the scale to .0001. Is that a problem?

End of the day: 19 out of 23 graphs working. 2 of those are relational graphs, which are doable. I don't know what to do with Inverse Gaussian and Pearson 5 yet.

ndh2 commented 9 years ago

The graphing routines you're using, do they have an option for stair-step type graphs? Those might look better for the discrete ones (binomial, discrete uniform, Poisson, hypergeometric, Bernoulli, negative binomial). Like this example (negative binomial):

negbinomial

Fixed value is weird because the real way to display it would be an infinitely slim, infinitely high vertical line. Any way you try to draw that is going to come out weird. I think the old NAADSM didn't even bother to display fixed values in a graph form.

josiahseaman commented 9 years ago

Good idea. I will start in on the stepwise graphs once I have all graphs working at least a little. Any ideas on the Pearson 5 or Inverse Gaussian? I'm going to start researching those today.

josiahseaman commented 9 years ago

Inverse Gaussian

rv = invgauss(mu, loc=0, scale=1) Frozen RV object with the same methods but holding the given shape, location, and scale fixed.

This is what I get if shape is given to loc, which looks correct: image From Docs: image

If shape is assigned to scale I get: image

josiahseaman commented 9 years ago

Pearson V Distribution

@ndh2 I'm having more trouble with terminology and variable names here: https://en.wikipedia.org/wiki/Pearson_distribution#The_Pearson_type_V_distribution

NAADSM Docs:

The Pearson 5 distribution (figure 0-17) is defined by its shape α and scale β, both of which must be greater than 0. A Pearson 5 distribution has a mean of β/(α-1). The probability density function f(x) is calculated as: where Γ is the Gamma function. image

Scipy has an inverse gamma distribution. But I can't map the variable names from NAADSM -> Wikipedia -> Scipy without really understanding the equations well. In particular, Wikipedia is referencing a, b1, b2 but I only have alpha and beta, so I might be missing one.
image image image

ndh2 commented 9 years ago

For the inverse Gaussian: SciPy's loc parameter shifts the curve left or right, which ADSM doesn't support, so we can ignore loc.

Here's the inverse Gaussian PDF formula copied from Wikipedia:

inv_gauss_formula

And here's the same thing in prob_dist.c in ADSM for verification:

sqrt (dist->lambda / (2.0 * M_PI * gsl_pow_3 (x))) *
 exp (- ((dist->lambda * gsl_pow_2 (x - dist->mu)) /
         (2.0 * gsl_pow_2 (dist->mu) * x)))

To check whether λ in this formula can be plugged into SciPy as the scale parameter, I plotted the formula above with μ=1,λ=3 and also plotted the line I get from scipy.stats.invgauss(1.0,scale=3). They don't match.

inverse_gaussian_demo_for_josiah

So to plot the inverse Gaussian, I think we have to take that formula I wrote out in C code above and use that in the GUI.

ndh2 commented 9 years ago

For the Pearson 5, we had problems with overflows in the PDF calculation. We ended up using a trick where most of the formula is calculated in log space. So doing that function is going to require finding a SciPy equivalent of GSL's gsl_sf_lngamma function. Lemme check into it.

josiahseaman commented 9 years ago

In the C code, what is the meaning of dist->lambda. I think this has a different meaning in Python. In Python, lambda is an annonymous function declaration followed by an argument list and return value. I can't tell where your annonymous function scope ends in the code:

sqrt (dist->lambda / (2.0 * M_PI * gsl_pow_3 (x))) *
 exp (- ((dist->lambda * gsl_pow_2 (x - dist->mu)) /
         (2.0 * gsl_pow_2 (dist->mu) * x)))
josiahseaman commented 9 years ago

Oh derp, I way over thought that question. dist is an object pointer that contains a variable called lambda and neither of those words have keyword significance, do they?

josiahseaman commented 9 years ago

Voila! This is using mean = 1 and shape = 3, just like yours.

image

If it's alright with you, I'd like to just plugin the math from C for the Pearson V as well. It seems simple enough, the main cost is a tiny performance hit, but I'm not graphing these a terribly high x resolution.

josiahseaman commented 9 years ago

I did notice in both your graph and mine that the peak goes all the way up to 1.0. I think this means it's not properly normalized. I assume your normalization step is somewhere else in the C code.

ndh2 commented 9 years ago

I'll post the math from C -- just need to find a Python equivalent of a function from GSL I was using.

A peak that goes up to 1.0 (or higher) is OK. (See examples on the inverse Gaussian Wikipedia page).

ndh2 commented 9 years ago

Formula for Pearson 5:

from scipy.special import gammaln
from math import exp,log

def p5_pdf(x,a,b):
    return exp (-b / x - gammaln(a) - (a + 1) * log (x / b)) / b

I tried graphing this formula with a few (a,b) pairs and I got lines that perfectly imitate the plot on the Vose site, so that tells me we're on the right track.

pearson5_for_josiah

josiahseaman commented 9 years ago

Good news: the images I've been creating do indeed play nice with the UI and embed correctly: image

Bad news: now I need to figure out how to find reasonable x ranges on infinite functions...

ndh2 commented 9 years ago

Maybe the .ppf method with values of .05 and 0.95?

josiahseaman commented 9 years ago

Thanks for the tip. That worked except that I lost my discrete function peaks now that the intervals aren't landing on whole numbers. I was going to fix that anyways.

image

josiahseaman commented 9 years ago

It looks like some of these (Beta, Weibull) are asymptotic. I tried increasing the sample resolution to make the two Beta peaks even and instead it shot the peaks up to 35.

image

josiahseaman commented 9 years ago

image

Histogram isn't perfect. Please remind me what the requirements for Histogram points were? I was thinking I might be able to build the rules into the graphing function, rather than requiring the user to do it.

ndh2 commented 9 years ago

The requirements are:

BryanHurst commented 9 years ago

I've merged in the i331 branch as requested by @josiahseaman However, the new scipy dependency is causing issues at compile time.

It crashes on setting dependency_links and states there is no scipy module (though it is confirmed to be installed through the Python Console).

BryanHurst commented 9 years ago

@josiahseaman We cannot use the latest version of scipy (0.16.0) as it does not support cx_freeze.

We need to find an installation file for the previous version (0.15.1) which does support cx_freeze. The scipy sourceforge page only has 32bit install files. And I cannot for the life of me figure out if there is a historical repo at http://www.lfd.uci.edu/~gohlke/pythonlibs/ as the download links are obscured.

BryanHurst commented 9 years ago

I've found a new repository of wheel file which had the older version of scipy that we needed. https://nipy.bic.berkeley.edu/pna/wheels/

ndh2 commented 9 years ago

@josiahseaman I found that SciPy's "invgamma" function is the same as our Pearson V. Use scipy.stats.invgamma(alpha,scale=beta). One less special case needed.

josiahseaman commented 9 years ago

Thanks Neil. It looks like Pearson 5 is working now. I've wrapped a check for divide by zero errors for all graphs. I've been checking the output graphs against real scenarios with actual values and they look good. Would you help me with verifying that this is actually working as intended?

image

josiahseaman commented 9 years ago

@ndh2 this is ready in master, in your dev environment. But I can't compile a release update at the moment, so if you want to test it, you'll have to git pull the dev version.

josiahseaman commented 9 years ago

I'm going to double check histogram behavior. Graphs are in a workable state, but discrete functions are still a little odd looking because they're infinitely thin sticks, rather than smooth distribution and sometimes the sampling doesn't match it. I'm going to get Priority Widgets then come back to this.

ndh2 commented 9 years ago

Would you help me with verifying that this is actually working as intended?

I have a bunch of example plots for all of the PDF types. I'll go through each and make sure I see the same curve when I plug the numbers into ADSM.

Continuous

Discrete

josiahseaman commented 9 years ago

Gaussian at 0

image

Thanks. I spotted another bug around required fields at the same time. If you have form submission suddenly not work on "Beta" "Gaussian" "Binomial" or "Uniform", that's the bug I just found and fixed.

josiahseaman commented 9 years ago

Hi Neil, I'll be working on the discrete sampling problem today. I'm letting you know since it looks like you're also working on this issue. Thanks again for your help in verifying this. I'm willing to do verification work as well if you would like to send me your graph examples.

ndh2 commented 9 years ago

I was just looking at the continuous distributions, not the discrete ones.

There are a bunch of great graph examples for the various discrete types here: Vose ModelRisk online docs

josiahseaman commented 9 years ago

@ndh2 I've got a basic solution for Discrete functions that is plotting correctly. Now I'm working on finding a way of calling them that works for both.

image

There were some casualties that I'll revisit: "Beta", "Weibull", "Pareto", "Pearson 5".