Open josiahseaman opened 9 years ago
I'm filling in SciPy functions for the various PDF types over on the wiki.
The first column (NAADSM XML) is only relevant for the xml2sqlite function. The last two columns are the ones that will be useful for this task.
Thanks! It looks like you're doing the hard work of this issue. Since you're hunting down the SciPy methods in order to get their parameters I will be able to copy your work and add the python context. It should be pretty straightforward on my end. Which means I could get to some other issues in time. Thank you so much for doing this.
You guys rock! I see there is some documentation I need to update at the bottom. I have the editing version of the pdf overview, but haven't attempted getting it on the wiki.
@josiahseaman the table of SciPy functions is filled in now. The fixed value, histogram, inverse Gaussian, Pearson V, and piecewise will have to be special cases since they're not offered in SciPy.
Thanks Neil. Could you walk me through an example of these expressions?
d = (min + 4*mode + max)/6
alpha = 6*((d - min)/(max - min))
beta = 6*((max - d)/(max - min))
scipy.stats.beta(
alpha,
beta,
loc=min,
scale=max-min
)
I see here you've given the formula for calculating d
but I don't see it in the parameters passed to scipy.stats.beta
. I don't see the parameter mentioned in Scipy docs. Most of the rest of these it looks like I can just plugin the values to the function.
The row I'm referring to is Beta-Pert, so maybe I'm looking at the wrong function. I am seeing other people online mention scipy.stats.beta.cdf(x, a, b, loc=0, scale=1) (Cumulative density function) in reference to Beta-Pert. Is this right? Is d = (min + 4*mode + max)/6
actually x
parameter?
'd' is just an intermediate value calculated from the 3 input parameters min, mode, and max. Calculating d makes the formulas for alpha and beta a bit shorter.
As an example if the input parameters are min=2, mode=5, max=6, then
d = (2+45+6)/6 = 4.66667 alpha = 6((4.66667-2)/(6-2)) = 4 beta = 6*((6-4.66667)/(6-2)) = 2
And now you have all the values needed to instantiate a beta, so pdf = scipy.stats.beta(4,2,2,4)
Oh, I'm sorry I didn't notice it was used later in the formula. So does this precalculation mean that's it's okay that Beta and Beta-Pert are essentially calling the same Scipy method?
Yep, it's OK! A beta-pert is just a beta distribution. But the usual parameters to a beta distribution (alpha and beta) are a bit unintuitive, so people came up with a way to specify more intuitive parameters (min, mode, max) and transform those parameters to alpha and beta.
I came to different conclusions than you about triangular based on this doc:
"The standard form is in the range [0, 1] with c the mode. The location parameter shifts the start to loc. The scale parameter changes the width from 1 to scale."
So I wrote: [scipy.stats.triang, {'loc':m.min, 'c':m.mode, 'scale': m.max}]
where m is the database model. Can you double check this and say which one of us is correct and why?
I think it's me. :) Try this:
import scipy.stats
minn=1.0
mode=3.0
maxx=4.0
dist = scipy.stats.triang((mode-minn)/(maxx-minn),loc=minn,scale=maxx-minn)
[dist.pdf(x) for x in (0,1,1.5,2,2.5,3,3.5,4,5)]
You'll get 0 before and at the minimum (at x=0 and x=1), 0 at and after the maximum (x=4 and x=5), and the highest value at the mode, x=3.
Thanks, I think I'll do testing for each in IPython Notebook so I can double check the parameter names for everything. The way I set it up, I'm not using positional arguments, if that turns into a problem, I can always change it. Here's what I have so far:
d = (m.min + 4 * m.mode + m.max) / 6
# log defaults to ln
eq = { # Compiled from: https://github.com/NAVADMC/ADSM/wiki/Probability-density-functions Thanks Neil Harvey!
"Bernoulli": [scipy.stats.bernoulli, {'p': m.p}],
"Beta": [scipy.stats.beta, {'a': m.alpha, 'b': m.alpha2, 'loc': m.min, 'scale': m.max - m.min}], # TODO: can't confirm param names
"BetaPERT": [scipy.stats.beta,
{'a': 6 * ((d - m.min) / (m.max - m.min)), 'b': 6 * ((m.max - d) / (m.max - m.min)), 'loc': m.min, 'scale': m.max - m.min}],
"Binomial": [scipy.stats.binom, {'s': m.s, 'loc': m.p}],
"Discrete Uniform": [scipy.stats.randint, {'a': m.min, 'b': m.max}],
"Exponential": [scipy.stats.expon, {'scale': m.mean}],
"Fixed Value": [], # TODO: Not offered in SciPy? Use a narrow uniform instead
"Gamma": [scipy.stats.gamma, {'a': m.alpha, 'scale': m.beta, 'loc': 0}],
"Gaussian": [scipy.stats.norm, {'loc': m.mean, 'scale': m.std_dev}],
"Histogram": [], # TODO: Not offered in SciPy? by hand
"Hypergeometric": [scipy.stats.hypergeom, {'M': m.m, 'n': m.d, 'N': m.n}],
# I don't know, but these seem to be the order of args in scipy/stats/_discrete_distns.py:306
"Inverse Gaussian": [], # TODO: Shape/lambda parameter not supported in SciPy?
"Logistic": [scipy.stats.logistic, {'loc': m.location, 'scale': m.scale}],
"LogLogistic": [scipy.stats.fisk, {'c':m.shape, 'loc': m.location, 'scale': m.scale}], # scipy/stats/_continuous_distns.py:683 http://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.fisk.html
"Lognormal": [scipy.stats.lognorm, {'sigma': (sqrt(log((std_dev**2 + mean**2) / mean**2))),
'scale': exp(log(mean**2 / sqrt(std_dev**2 + mean**2)))}], # I think exp(log()) is redundant
# Double check this, there's a lot of math
"Negative Binomial": [scipy.stats.nbinom, {'n': m.s, 'p': m.p}],
"Pareto": [scipy.stats.pareto, {'theta': m.theta, 'scale': m.a}],
"Pearson 5": [], # TODO: Not offered in SciPy
"Piecewise": [],
"Poisson": [scipy.stats.poisson, {'k': m.mean}],
"Triangular": [scipy.stats.triang, {'loc':m.min, 'c':(m.mode-m.min)/(m.max-m.min), 'scale': m.max-m.min}],
"Uniform": [scipy.stats.uniform, {'loc': m.min, 'scale':m.max-m.min}],
"Weibull": [scipy.stats.weibull_min, {'x':m.alpha, 'scale': m.beta}],
}
function = eq[kind][0]
kwargs_dict = eq[kind][1]
return pdf_graph(m.x_axis_units, function, kwargs_dict)
I think I'm going to have to do something funny to prevent expressions from evaluating when they're referencing fields from other equation types. It's coming along faster than I had hoped. Thanks for all your help.
I'm getting there :)
I find something about the little graphs really cute.
yeah! Almost half way!
@ndh2 Is this right? All I'm giving it is a "p" of 0.5. I'm calling the Probability Mass Function, instead of the density function. So I'm not sure if this is a valid representation for the user.
I'm thinking this is the correct output since Bernouli is just chosing between 1 and 0. So at 0.9 it's 1 nine tenths of the time. It just looked odd at 0.5.
I'm making up input variables, so I hope these are all reasonable and within bounds for tests:
m = empty()
m.mean = 0.5
m.min=1.0
m.mean
m.mode=3.0
m.max=4.0
m.std_dev = 0.5
m.alpha = 0.5
m.alpha2 = 0.5
m.beta = 0.5
m.location = 0.5
m.scale = 0.5
m.shape = 0.5
m.n = 2
m.p = 0.9
m.m = 3
m.d = 5
m.theta = 0.5
m.a = 0.5
m.s = 7
The scale is off on Fixed Value. It'll say p = 10,000.0 if I set the scale to .0001. Is that a problem?
End of the day: 19 out of 23 graphs working. 2 of those are relational graphs, which are doable. I don't know what to do with Inverse Gaussian
and Pearson 5
yet.
The graphing routines you're using, do they have an option for stair-step type graphs? Those might look better for the discrete ones (binomial, discrete uniform, Poisson, hypergeometric, Bernoulli, negative binomial). Like this example (negative binomial):
Fixed value is weird because the real way to display it would be an infinitely slim, infinitely high vertical line. Any way you try to draw that is going to come out weird. I think the old NAADSM didn't even bother to display fixed values in a graph form.
Good idea. I will start in on the stepwise graphs once I have all graphs working at least a little. Any ideas on the Pearson 5 or Inverse Gaussian? I'm going to start researching those today.
rv = invgauss(mu, loc=0, scale=1) Frozen RV object with the same methods but holding the given shape, location, and scale fixed.
This is what I get if shape is given to loc, which looks correct: From Docs:
If shape is assigned to scale I get:
@ndh2 I'm having more trouble with terminology and variable names here: https://en.wikipedia.org/wiki/Pearson_distribution#The_Pearson_type_V_distribution
The Pearson 5 distribution (figure 0-17) is defined by its shape α and scale β, both of which must be greater than 0. A Pearson 5 distribution has a mean of β/(α-1). The probability density function f(x) is calculated as: where Γ is the Gamma function.
Scipy has an inverse gamma distribution. But I can't map the variable names from NAADSM -> Wikipedia -> Scipy without really understanding the equations well. In particular, Wikipedia is referencing a, b1, b2
but I only have alpha and beta, so I might be missing one.
For the inverse Gaussian: SciPy's loc
parameter shifts the curve left or right, which ADSM doesn't support, so we can ignore loc
.
Here's the inverse Gaussian PDF formula copied from Wikipedia:
And here's the same thing in prob_dist.c
in ADSM for verification:
sqrt (dist->lambda / (2.0 * M_PI * gsl_pow_3 (x))) *
exp (- ((dist->lambda * gsl_pow_2 (x - dist->mu)) /
(2.0 * gsl_pow_2 (dist->mu) * x)))
To check whether λ in this formula can be plugged into SciPy as the scale
parameter, I plotted the formula above with μ=1,λ=3 and also plotted the line I get from scipy.stats.invgauss(1.0,scale=3)
. They don't match.
So to plot the inverse Gaussian, I think we have to take that formula I wrote out in C code above and use that in the GUI.
For the Pearson 5, we had problems with overflows in the PDF calculation. We ended up using a trick where most of the formula is calculated in log space. So doing that function is going to require finding a SciPy equivalent of GSL's gsl_sf_lngamma
function. Lemme check into it.
In the C code, what is the meaning of dist->lambda
. I think this has a different meaning in Python. In Python, lambda is an annonymous function declaration followed by an argument list and return value. I can't tell where your annonymous function scope ends in the code:
sqrt (dist->lambda / (2.0 * M_PI * gsl_pow_3 (x))) *
exp (- ((dist->lambda * gsl_pow_2 (x - dist->mu)) /
(2.0 * gsl_pow_2 (dist->mu) * x)))
Oh derp, I way over thought that question. dist
is an object pointer that contains a variable called lambda
and neither of those words have keyword significance, do they?
Voila! This is using mean = 1 and shape = 3, just like yours.
If it's alright with you, I'd like to just plugin the math from C for the Pearson V as well. It seems simple enough, the main cost is a tiny performance hit, but I'm not graphing these a terribly high x resolution.
I did notice in both your graph and mine that the peak goes all the way up to 1.0. I think this means it's not properly normalized. I assume your normalization step is somewhere else in the C code.
I'll post the math from C -- just need to find a Python equivalent of a function from GSL I was using.
A peak that goes up to 1.0 (or higher) is OK. (See examples on the inverse Gaussian Wikipedia page).
Formula for Pearson 5:
from scipy.special import gammaln
from math import exp,log
def p5_pdf(x,a,b):
return exp (-b / x - gammaln(a) - (a + 1) * log (x / b)) / b
I tried graphing this formula with a few (a,b) pairs and I got lines that perfectly imitate the plot on the Vose site, so that tells me we're on the right track.
Good news: the images I've been creating do indeed play nice with the UI and embed correctly:
Bad news: now I need to figure out how to find reasonable x ranges on infinite functions...
Maybe the .ppf method with values of .05 and 0.95?
Thanks for the tip. That worked except that I lost my discrete function peaks now that the intervals aren't landing on whole numbers. I was going to fix that anyways.
It looks like some of these (Beta, Weibull) are asymptotic. I tried increasing the sample resolution to make the two Beta peaks even and instead it shot the peaks up to 35.
Histogram isn't perfect. Please remind me what the requirements for Histogram points were? I was thinking I might be able to build the rules into the graphing function, rather than requiring the user to do it.
The requirements are:
I've merged in the i331 branch as requested by @josiahseaman However, the new scipy dependency is causing issues at compile time.
It crashes on setting dependency_links and states there is no scipy module (though it is confirmed to be installed through the Python Console).
@josiahseaman We cannot use the latest version of scipy (0.16.0) as it does not support cx_freeze.
We need to find an installation file for the previous version (0.15.1) which does support cx_freeze. The scipy sourceforge page only has 32bit install files. And I cannot for the life of me figure out if there is a historical repo at http://www.lfd.uci.edu/~gohlke/pythonlibs/ as the download links are obscured.
I've found a new repository of wheel file which had the older version of scipy that we needed. https://nipy.bic.berkeley.edu/pna/wheels/
@josiahseaman I found that SciPy's "invgamma" function is the same as our Pearson V. Use scipy.stats.invgamma(alpha,scale=beta)
. One less special case needed.
Thanks Neil. It looks like Pearson 5 is working now. I've wrapped a check for divide by zero errors for all graphs. I've been checking the output graphs against real scenarios with actual values and they look good. Would you help me with verifying that this is actually working as intended?
@ndh2 this is ready in master, in your dev environment. But I can't compile a release update at the moment, so if you want to test it, you'll have to git pull the dev version.
I'm going to double check histogram behavior. Graphs are in a workable state, but discrete functions are still a little odd looking because they're infinitely thin sticks, rather than smooth distribution and sometimes the sampling doesn't match it. I'm going to get Priority Widgets then come back to this.
Would you help me with verifying that this is actually working as intended?
I have a bunch of example plots for all of the PDF types. I'll go through each and make sure I see the same curve when I plug the numbers into ADSM.
Continuous
Discrete
Gaussian at 0
Thanks. I spotted another bug around required fields at the same time. If you have form submission suddenly not work on "Beta" "Gaussian" "Binomial" or "Uniform", that's the bug I just found and fixed.
Hi Neil, I'll be working on the discrete sampling problem today. I'm letting you know since it looks like you're also working on this issue. Thanks again for your help in verifying this. I'm willing to do verification work as well if you would like to send me your graph examples.
I was just looking at the continuous distributions, not the discrete ones.
There are a bunch of great graph examples for the various discrete types here: Vose ModelRisk online docs
@ndh2 I've got a basic solution for Discrete functions that is plotting correctly. Now I'm working on finding a way of calling them that works for both.
There were some casualties that I'll revisit: "Beta", "Weibull", "Pareto", "Pearson 5".
Probably more difficult because we need to accommodate for each equation type.
330 again.
I think this issue requires me to match every equation type option to a function in matplotlib. If that's the case, it'll take me a while compared to Relational Functions