AmpersandTV / pymc3-hmm

Hidden Markov models in PyMC3
Other
97 stars 13 forks source link

Test negative-binomial emissions/responses #12

Closed brandonwillard closed 4 years ago

brandonwillard commented 4 years ago

Let's see how well our HMM state-sequence and transitions samplers work under negative-binomial emissions/response when its distribution parameters are estimated using NUTS.

We should also fit the aforementioned model to data generated by an HMM with two non-zero Poisson components having rates that are relatively overlapping, and vice versa (i.e. fit data produced by a negative-binomial using a two-component Poisson mixture).
In this situation, we can—for instance—see how well the extra dispersion of the negative-binomial can account for data that might otherwise require another Poisson component.

Note: These considerations are also related to #11, since the resulting posterior of a Poisson-gamma is negative-binomial.

fanshi118 commented 4 years ago

How do we pick parameters, corresponding to Poisson and Gamma distributions, to initialize negative-binomial emissions? Is there a go-to approach for this, or is it something we have to test out?

brandonwillard commented 4 years ago

You're referring to the negative-binomial parameters (i.e. mu and alpha according to PyMC3's parameterization), right?

What you need to do is assign those parameters some prior distributions (or make the fixed constants, but that's not what we're interested in testing). To start, the supports of the priors must generally correspond with the acceptable parameter values; in this case, mu, alpha > 0, so distributions over the positive reals are worth considering. For example, gamma and log-normal distributions have compatible supports for mu and alpha.

As a default, people will often use a conjugate distribution. This can be helpful, because then the (conditional) posterior is known and can be checked. Under the total/rate parameterization of the negative-binomial, the rate (or probability) parameter has a beta conjugate prior, but I don't know of a conjugate prior for the total (or count) parameter. You'll have to do a little research/math to determine how that translates under PyMC3's parameterization (e.g. perhaps the beta conjugate prior translates to a gamma distribution).

Otherwise, one generally chooses a compatible prior that assigns some "a priori" acceptable probability to parameter values (by definition). So, for instance, if there's reason to believe that larger values of a parameter are more probable than smaller ones, the prior density should reflect that. Such consideration can be reflected through the choice of prior distribution and/or its parameter values. (Naturally, there are meaningful interactions between combinations of priors and likelihoods/observation distributions to consider, but let's keep things simple for now)

In summary, just try out some compatible priors, and use parameters that don't unnecessarily concentrate the density on specific values.

fanshi118 commented 4 years ago

In this example I made, I used gamma and log-normal priors as parameters, respectively, for mu and alpha in negative-binomial. I kept the same parameters for gamma priors as used in the original Poisson example, while picking arbitrarily large parameters for log-normal.

Observing the results in the trace plot, the alpha parameters tend to concentrate around 0, which probably isn't due to the priors themselves. Thus, we may speculate that the likelihood dragged those priors down to 0 in the posterior, which would imply that the negative-binomials "collapse" into regular Poissons. The similar outputs in the result of the two models (last graph in both notebooks) also hint at this conclusion, yet we're steps away from actually reaching there.

Some of the next steps we need to do:

  1. first conclude that these priors haven't resulted in posteriors that converge to that "ideal" result
  2. do some math to prove that these posterior values are in-line with our expectations/"ideal" outcome
brandonwillard commented 4 years ago

In this example I made, I used gamma and log-normal priors as parameters, respectively, for mu and alpha in negative-binomial.

I think you meant to say that you used the gamma and log-normal as priors, and not that you used the priors as parameters.

Also, you should clarify that the simulated data was drawn from a 3 component Poisson HMM (i.e. one zero component, and two Poisson's with non-zero rates). This tells us that the data we're fitting should be indicating simple Poissons instead of our model's negative-binomials. Now, when we discuss the "collapsing" to Poissons, the results make a lot more sense.

Observing the results in the trace plot, the alpha parameters tend to concentrate around 0, which probably isn't due to the priors themselves.

Since, as you state in the preceding paragraph, the log-normal priors on the alpha_* parameters were given large values, the implication is that the log-normal means were large or—at least—not close to zero, so we can assume that the concentration around zero isn't due to a highly informative/restrictive prior, or simply a prior that has little relevance to the model + data.

Thus, we may speculate that the likelihood dragged those priors down to zero in the posterior, ...

This is an extremely informal statement, but, more importantly, it largely echos the previous statement: i.e. that the posterior alpha_* results aren't simply an artifact of the priors.

...which would imply that the negative-binomials "collapse" into regular Poissons.

I don't think that implication is best described as an interaction between the prior and likelihood, but, when the alpha_* parameters go to zero, the situation sounds strikingly similar to the classic negative-binomial/Poisson limit relationship (and I'm pretty sure that's exactly what it is). By checking the parameterization used in PyMC3 and cross-referencing with those classic limit results (i.e. doing the math), we can confirm that near zero alpha_* parameters do indeed produce Poisson-like mixture components and, as a result, we can build a case for our posteriors having converged to the true data-generating model.

The similar outputs in the result of the two models (last graph in both notebooks) also hint at this conclusion, yet we're steps away from actually reaching there.

To be more precise, the posterior predictive output looks very much like the output from an equivalent Poisson model (i.e. the true data-generating process). This is a loose confirmation of the possible collapsing/convergence-to-the-true-model mentioned earlier. Although, from this vantage alone, it's difficult to tell exactly how or why these good, Poisson-looking results were achieved. Nevertheless, we have yet another confirmation of relatively good/expected results.

After checking the math, the product of this experiment could easily be another test that fits such a model and confirms that the estimated alpha_* posteriors are near zero.

Next, we should try fitting the negative-binomial model against data that's actually generated from a negative-binomial HMM, so we can make sure that the parameters can be correctly estimated by our samplers. Now, in this case, we need to properly consider the identification issues.
For the Poisson model, we used incremental rate parameters, which naturally ordered the rates and made them identifiable. In the case of negative-binomial parameters, I don't know if their parameterization allows for such additivity. Although an incremental parameter approach was used in your example notebook, the near-zero alpha_* values—and their implied reduction to Poissons—may have avoided the identification issue altogether, but that won't be the case in this experiment.

fanshi118 commented 4 years ago

Just made a new gist here.

In this new example, I generated data from two Poisson's, and fitted the negative binomial model to it. The mus were kept the same, in data generation, as in the previous example - 5000 and 7000. Thus, we wanted to pick such alpha parameters that reproduce those values. I tried a few ones and found that when alpha is kept 1-d we could have E[Y_t | S_t = 1] = 5000 and E[Y_t | S_t = 2] = 7000, and just used [2] in the example.

In terms of the identification issues, we ideally want the same situation as Poisson's for Negative Binomial: maintaining some sort of ordering between the two states and making sure their probability distributions aren't equal. I attempted by summing the mus and keeping the alphas separate. In the trace plot, I’m seeing the larger alpha gets shrunk to almost 0, while the smaller one has some value.

Lastly, the predictive output doesn’t look similar to what we observed in the earlier examples.

Going to run the Negative-Binomial simulated data next.

brandonwillard commented 4 years ago

This is another interesting result: it looks like the first negative-binomial component "converged" to a simple Poisson with a rate parameter somewhere between the rates of the two data-generating Poisson rates, and the second negative-binomial took a slightly larger mean (a little closer to the larger second Poisson component rate) with a large(-ish) dispersion parameter!

Due to the third true state (i.e. second, larger Poisson component) being so frequent in the simulated data, the second estimated negative-binomial component (i.e. the estimated third state) appears to be the most "active" and, because of its extra dispersion, it also subsumes some of the estimated second state observations (which are distinctly third state observations in the true model). It would be interesting to see the estimated state sequence vs. the true state sequence, then we could see exactly how often this over-dispersed third state attempts to account for the true seconds state

In other words, it appears as though a single highly dispersed negative-binomial component will indeed account for two Poisson components—even when the two Poissons are "well separated" in their rates (e.g. in this case, with a rate parameter difference of 2000).

brandonwillard commented 4 years ago

Also, when series are long enough, it's difficult to see the results of rarely visited states (e.g. the second state in the posterior output). In these cases, we should split up the graph into stacked subplots so that we can more easily see each observation.

Here's a small snippet demonstrating how to split up a Pandas Series/DataFrame, y_test, of time-series observations by week:

import matplotlib.pyplot as plt
import matplotlib.dates as mdates
import matplotlib.transforms as mtrans

# y_test_splits = np.array_split(y_test, 4)
y_test_splits = [y_split for n, y_split in y_test.groupby(pd.Grouper(freq='W'))]

n_partitions = len(y_test_splits)

fig, axes = plt.subplots(nrows=n_partitions,
                         sharey=False,
                         sharex=False,
                         figsize=(15, 6.0))

major_offset = mtrans.ScaledTranslation(0, -10/72., fig.dpi_scale_trans)

for i, ax in enumerate(axes):
    ax.plot(y_test_splits[i],
            label=r'$y_t$', color='black',
            drawstyle='steps-pre', linewidth=0.5)

    ax.xaxis.set_minor_locator(mdates.HourLocator(byhour=range(0, 23, 2)))
    ax.xaxis.set_minor_formatter(mdates.DateFormatter('%H'))
    ax.xaxis.set_major_locator(mdates.WeekdayLocator(byweekday=range(0, 7, 1)))
    ax.xaxis.set_major_formatter(mdates.DateFormatter('%Y-%m-%d %a'))

    for label in ax.xaxis.get_majorticklabels():
        label.set_transform(label.get_transform() + major_offset)

    ax.legend()

plt.tight_layout()

If you're not using a Pandas Series/DataFrame, you can use the commented np.array_split to arbitrarily split a NumPy time-series.

fanshi118 commented 4 years ago

New notebook fitting Negative-Binomial model to data generated from Negative-Binomial distribution is here.

I copied the model parameters from this earlier example. This time, the simulated data was drawn from a 3-component Negative-Binomial HMM (one zero component, and two Negative-Binomial's with mu and alpha respectively at: 5000 and 3, 7000 and 8). The series looks more continuous than the Poisson example, and as we plot the posterior distributions, we observe the similar "near-zero" behavior in alpha_*. The p_* parameters, on the other hand, exhibits slightly different behaviors, as the smallest component in value moves up / further from 0.

The posterior predictive output looks sort of different from the Poisson case as well, and part of that can be due to how the y values have changed in distribution. Also, I split the graph in 4, so we can have a closer look at the result.

brandonwillard commented 4 years ago

Is this line in the posterior predictive plotting code supposed to use index 0: ax.fill_between(post_pred_imps_hpd_df.index[pred_idx_splits[0]],?

fanshi118 commented 4 years ago

Is this line in the posterior predictive plotting code supposed to use index 0: ax.fill_between(post_pred_imps_hpd_df.index[pred_idx_splits[0]],?

O ya, that one is just an array of indices [0, 1, 2, … , 499]. Otherwise, the predicted values won't align with observed / actuals in the plot. edited in the notebook

brandonwillard commented 4 years ago

These results seem to indicate that both negative-binomials have—once again—"collapsed" into Poissons! That's not what we want in this case.

I think the reason it's doing this might have to do with the additive alpha_*_rv parameterization—and possibly the extreme priors on these alpha_*_rv. Given how long it took to generate posterior samples (~20 minutes?) and the acceptance probability discrepancy, NUTS might've been dealing with a very difficult log-likelihood, which shouldn't necessarily be the case for this kind of model. If the sampler had trouble moving around, it might not have arrived at a more reasonable region that included the "true" model. It clearly "knew" that the dispersion parameters where no where near as large as their priors indicated, but it took that to the extreme and didn't seem to do much else.

The additive alpha_*_rv parameterization seems overly restrictive, too, now that I think about it. It implies that a negative-binomial component with a larger mean/mu_* should necessarily have a larger dispersion, alpha_*_rv, than it's preceding, ordered mixture components. This isn't a reasonable assumption, since there's no reason that a component with a larger mean shouldn't have a smaller dispersion than components with smaller means. Otherwise, when the dispersion parameter increases, the negative-binomial mixture components are more and more likely to overlap, leading to the some of the identification issues we're trying to avoid with this parameterization in the first place!

That doesn't necessarily explain the results, but it's an important general consideration. At the very least, it could explain why the sampler had such an apparently hard time.

My advice: try removing the alpha_1_rv + alpha_2_rv parameterization and lowering the alpha_*_rv priors to something more reasonable. Perhaps something closer to the range of (0, 1] with a long-ish "tail", like pm.Lognormal('alpha_*', 0, 1). This seems like a prior we might actually want to use on real data.

fanshi118 commented 4 years ago

Good points. Here's the updated example with separated alpha_*_rv parameters whose values are kept small. I do see the estimated posterior alpha_*'s have non-zero centered values this time, which implies the Negative-Binomial model doesn't collapse into Poisson in this case.

In the posterior predictive plot, the predictions seem more continuous / less discrete than the previous example, which is pattern-wise closer to our data generated from Negative-Binomial distribution.

brandonwillard commented 4 years ago

Yes, those are better results (relatively speaking). The "true" negative-binomial parameters weren't recovered, but there's still a lot that could explain that.

First, can you add a black/dark blue line with the posterior predictive mean/50% HDI? That would help clarify how well the model is performing in terms of expected value. I believe that will tell us something very important about these results.

Also, it would help to show the estimated state sequences. Comparison with the "true" sequence might be difficult—given the estimated negative-binomial components—but it could still help us identify other relevant pathologies.

(FYI: I just noticed that we've been using the label "...HDI interval", which is a little redundant: i.e. "high-density interval interval")

Some Points about Priors

You went a bit low and strong on the prior parameters for alpha_*_rv, which implies a rather informative prior. With informative priors, the posterior results are more likely to be drawn toward relatively "acceptable" parameter values that correspond with said priors. Notice how the alpha_* posterior samples are distributed about a mean value just a little over 1.0; that's not far from the prior means, which are both a little above 1.6.

It helps to think of the most extreme cases, and the Dirac delta prior is a good one. If you collapsed the prior onto one value (i.e. expressed some form of prior "certainty" about the parameter), that value is all you'd get back in the posterior—unless, perhaps, the likelihood also collapsed onto a different value. The opposite situation, in which—say—nearly equal probability is given to all values, results in "uninformative" priors. There's a lot more to this situation, but these are the basic ideas.

You can also consider a prior distribution as a type of constraint in an optimization problem. As with constraints, priors can guide results toward preferred values and away from other ones.

Overall, we (or at least I) have a few a priori assumptions that aren't expressed via these priors, and those could be contributing to these results, but, given that we're simulating the data, it's hard to separate those expectations from our direct knowledge of the "true" model. We can still try, though.

To start, we might want to assume that the alpha_*_rv are just as likely to be around 0 as they are to be around 1—and perhaps some larger values. Now, I don't know how well one can express such a relationship using a log-normal, but I think we can do much better than we currently are. For example, a gamma with shape less than 1 and a scale value larger than 1 puts a fair amount of density near zero and values up to 1. The gamma will cut off pretty sharply after that, and, if that's not desirable, we can look into the generalized gammas and other more flexible distributions.

Taking a somewhat empirical bayesian approach, we can at least say that—given the observed data—the gamma mu_*_rv priors could easily have larger means that are a greater distance apart. Additional observations-based assessments could be used to set initial state and transition probability priors that coincide with our non-empirical assumption that there are two distinct states.

fanshi118 commented 4 years ago

Updated the plots with 50% HDI added (same link); fixed those labels too. Looking into prior selection in the meantime.

brandonwillard commented 4 years ago

The updated notebook won't load for me now. Is there a lot of new output in the notebook?

fanshi118 commented 4 years ago

Is there a lot of new output in the notebook?

Don't think so, but looks like that notebook has some rendering issues… Anyway, here's a new one.

I set up the priors for alpha_*_rv in such a way that shape<1 and scale>1. I also made the two priors a bit different: alpha_1_rv meets the threshold by a small margin, while alpha_2_rv significantly exceeds the threshold. In the trace plot, they both end up somewhere near and less than 1. The posterior predictive plot looks somewhat similar to the previous one.

Additional observations-based assessments could be used to set initial state and transition probability priors that coincide with our non-empirical assumption that there are two distinct states.

Should we do anything to the current mu_*_rv on this part?

brandonwillard commented 4 years ago

I set up the priors for alpha_*_rv in such a way that shape<1 and scale>1. I also made the two priors a bit different: alpha_1_rv meets the threshold by a small margin, while alpha_2_rv significantly exceeds the threshold. In the trace plot, they both end up somewhere near and less than 1. The posterior predictive plot looks somewhat similar to the previous one.

I don't think we need these two priors to be different in any way. They're not exactly bound by the same identifiability/overlap-related concerns as the mu_*_rv are, except—perhaps—that we don't want them to be too large; otherwise, both negative-binomials will overlap too much and that can cause identifiability issues. In other words, if the mu_*_rv are distinct and the alpha_*_rv are both small and equal, there's no problem. For instance, in the limit, when both are near zero, we would have two well-identified Poissons!

Just so you know, there's also some subjectivity to this concern we've had over "identifiability". There could be situations in which negative-binomials with nearly equal means (i.e. mu_*_rvs) and distinct dispersion parameters (i.e. alpha_*_rvs) are desirable. Consider some kind of "stochastic volatility" modeling scenario, where the intent is to identify distinct volatility regimes (e.g. low and high volatility states).

Should we do anything to the current mu_*_rv on this part?

Yeah, if we want to obtain the original model in situations with a lot of "overlap", as I stated at the end of my earlier comment, we need to express the types of solutions we want via our priors, and part of that could be accomplished using more sophisticated parameterizations or in an empirical way.

For instance, we know that we're estimating two negative-binomial components, and that we want their means to be distinct in some sense. We have an additive parameterization that does this, but it doesn't take into account exactly how distinct the means are, so our estimates end up producing means that are not very distinct relative to their magnitudes. A more sophisticated approach might take into account the variances, and, perhaps, allow us to specify a degree of distributional overlap that we're willing to accept.

As an example of the empirical approach, we could partition the range of observed data and set E_*_mu to the ordered means of those partitions. For that matter, we can even use point estimates from other models!

There are plenty more things we can do in both of these categories, and we can even use both approaches simultaneously.

Note: before you get too inventive, it always helps to quickly research what others have already done. Often, Google Scholar is a great place to start.

For testing purposes, we don't need to be too sophisticated, so just try "informing" those priors manually (e.g. set the E_*_mu to values somewhat near the true ones) and see if it yields what you expect.

fanshi118 commented 4 years ago

Sounds good. Just incorporated some of those points in this new notebook.

The alpha_*_rv are kept relatively close. I also increased the E_*_mu to somewhat near the simulated data - particularly E_2_mu such that the additive parametrization draws a greater distinction between the two distributions we append.

In the predictive posterior plot, I'm seeing more variance than before in the 50% line.

Going to look into some more sophisticated approaches, such as taking variances into account and using point estimates from other models.

brandonwillard commented 4 years ago

The mean behaviour is much better, but the results are still too concentrated on the third mixture component; however, after seeing the results of these past few changes, I get the impression that very strong priors will be necessary in order to accurately recover the "true" parameters for this simulation.

The over-dispersion parameters in the simulation are too large (relative to the means) and the resulting simulation ends up being too "vague" for our priors (and model, really). For instance, you might need to set even more of the priors closer to the true parameter values in order to get the desired posteriors. Otherwise, the level of sophisticated required for a parameterization to account for this vagueness could be too great for what it's worth.

This has been a very illustrative exercise, but, for the sake of moving forward, let's just make the simulation more amenable to these priors and focus on confirming that the samplers work at basic level. As a matter of fact, I would consider even more simplifications so that the test runtime isn't unnecessarily long (without sacrificing the test coverage itself, of course).

fanshi118 commented 4 years ago

In order to shorten the test runtime, should we change the priors around? I've tried using the same priors with different parametrization in the simulated data (i.e. shrinking means and/or dispersion parameters), and so far there hasn't been a noticeable drop in the amount of time taken to generate posterior samples.

On a separate/related note, in terms of simplification, should we focus more on the simulated data or the model?

brandonwillard commented 4 years ago

I was saying that we should simplify the simulation (e.g. reduce the dispersion and/or overlap between mixture components) so that we can keep using simple priors. Again, we just want to make sure that we can estimate multiple negative-binomial mixture components in an HMM, and that starts with the case in which those components are easily identifiable.

Also, keep in mind that acceptable unit tests are as small, quick, and simple as possible, so things like posterior predictive sampling aren't an option. We need to base our checks on the posterior samples alone, and small, extreme cases are easier to check.

For example, you could simulate an HMM with two negative-binomials and effectively zero overlap between them, then you can more confidently apply simple checks against the true and estimated parameter values.

Note: you can easily "guesstimate" values that achieve such overlap, but, if you want to know more about the general idea, look up the Hellinger distance, Bhattacharyya coefficient, etc.

fanshi118 commented 4 years ago

Makes sense. I reduced the dispersion parameters and increased the relative amount of separation between the mus in this simulation, as well as adjusting the mu_*_rv in the model to be congruent with the updated data. As a result, the amount of overlap between mixture components has been mitigated, judging the trace plot. Is this somewhat in line with what we expect, from a simplification standpoint?

Rogering on the small, quick, and simple part re: unit tests.

brandonwillard commented 4 years ago

That simulation does look better, although the true over-dispersion/"index" parameters (i.e. alpha_*) were pretty close to each other and their estimated values were a bit off.

Try setting one of the alpha_* to something near zero (e.g. ~0.05) and the other to something greater but well below 1 (e.g. ~0.2)—i.e. small values that are orders of magnitude apart. If we can get posterior samples for alpha_* in those ranges, then that would be a good sign.

Also, it might be good to include Var[Y | S] alongside your E[Y | S] sample estimates; that will help illustrate the amount of overlap induced by even seemingly small values of alpha_*.

fanshi118 commented 4 years ago

Okay, I see what you're referring to re: alpha_* being close and estimated values a bit off.

Here's a new notebook. I replaced the dispersion parameters with 0.05 and 0.2; however, the result didn't seem to have gone as well as we thought… Judging the trace plot, the posterior samples for alpha_* are kind of overestimated (although there's a distinguishable gap in between of the two), while the mean behaviors seem not as good as last time's.

In the sample estimates, I added two lines in green to delineate Var[Y | S] alongside E[Y | S].

fanshi118 commented 4 years ago

An updated example here.

I kept tuning the dispersion parameters so that the alpha_* in our posterior samples could fall within (or get close to) their ranges. With values at 1 and 5, I was able to get something that seemed reasonable. However, the mu_* in our posterior samples have gone a bit less accurate than last time's.

In the predictive posterior plot, I used the standard deviation of Y | S and kept the confidence interval at 80%. Let me know if another threshold works better.

brandonwillard commented 4 years ago

The alpha_*_rv priors still have distribution parameter values that aren't very "accepting" of the value regions in which the "true" over-dispersion parameters lie (e.g. both priors have means greater than one). I believe that—in combination with the natural "shapes" of the chosen alpha_*_rv priors under those parameter values—mostly explains the overly large posterior results we're seeing.

I recommend adding plots for the prior distributions; those will clarify a lot. In general, if you simulate model parameters that lie in comparatively low probability regions of your priors, then it's not unlikely that the estimation method (MCMC in this case) will focus its efforts almost exclusively in other, relatively acceptable regions. Without specialized parameterizations/constraints or estimation methods to compensate for these low-probability regions of the prior, there's not much more that can be done to remedy the situation—beyond the use of more and/or longer sample chains (in the hope that some samples will be drawn from the "true" low-prior-probability regions, of course).

Otherwise, I don't understand those new intervals that were added to the plots. They look like classical CLT-like (i.e. Gaussian-based) confidence intervals, but such approximations aren't necessary here, since we have entire sets of posterior (predictive) samples for each observation, so we can always compute more accurate sample statistics instead.
When I mentioned Var[Y | S], I was simply referring to sample variances for the simulated observations. In other words, it might help to look at values like the following in order to understand how much the alpha_* values can affect the dispersion of the simulated output:

print("Var[Y_t | S_t = 1] = {}".format(np.var(y_test[s_test == 1])))
print("Var[Y_t | S_t = 2] = {}".format(np.var(y_test[s_test == 2])))
brandonwillard commented 4 years ago

Here's some example code that plots PyMC3's negative-binomial according to the density/mass function defined in pm.NegativeBinomial.random:

import numpy as np
import scipy.stats as st

import matplotlib.pyplot as plt

def NegBinom(x, mu, alpha):
    return st.gamma.pdf(x, a=alpha, scale=mu/alpha)

x = np.arange(0, 2000)
alphas = 1. / np.r_[0.05, 0.2]
mus = np.r_[10, 1000]

fig, axes = plt.subplots(2, sharex=True)
for a, m, ax in zip(alphas, mus, axes):
    pmf = NegBinom(x, m, a)
    ax.plot(x, pmf, '-', label=r'$\alpha$ = {}, $\mu$ = {}'.format(a, m))
    ax.set_ylabel('nbinom.pmf(x)', fontsize=10)
    ax.legend()

plt.xlabel('x', fontsize=12)
plt.tight_layout()
plt.show()

Notice that the alpha parameters are reciprocals of the values we've been considering. That said, it looks like we've been thinking about the parameterization incorrectly!

brandonwillard commented 4 years ago

Note: PyMC3 uses half-Cauchys as default priors for mu and alpha in GLM models.

fanshi118 commented 4 years ago

Thanks for the notes! Here's the updated example with the reciprocals of the values we were considering. I've added those prior plots, using the above code snippet, after the cell where we print out E[Y | S] and Var[Y | S].

In the posterior samples, I'm seeing mu_1 and alpha_1 falling within the ranges for which we were looking; mu_2 and alpha_2, on the other hand, didn't get quite close, despite moving towards 1000 and 5 respectively. The mixture component plot seemed a bit off…just judging the shape of p_2.

In the posterior predictive plot, there's an increased degree of sparsity overall. Could it be possibly due to the reduction in dispersion, as we initialized the priors?

brandonwillard commented 4 years ago

Yeah, it does look a little better; however, I only see plots for the component/emissions distributions—which are in the observation/likelihood-defining part of the model—and not the priors, so I can't tell whether or not those posterior discrepancies are due to the influence of unnecessarily informative priors. It still seems quite likely that they are.

You can use pm.sample_prior_predictive to generate samples from all or a few of the test_model priors, then you can plot the results using pm.traceplot without having to rewrite code. I recommend sampling and plotting only the mu_*_rv and alpha_*_rv terms for now.

Some minor notes:

fanshi118 commented 4 years ago

Thanks for the notes, again. I've addressed these points in here.

This time, we're using the regular definition of the dispersion parameters, aka alpha values, and inverting them as we plug them in to the negative-binomial distribution functions. In our model, we changed the mu_*_rv in such a way that they overlap less with the simulated data than they did last time; plus, we initialized the alpha_*_rv with pm.HalfCauchy priors (as opposed to Gamma).

The result seems to make sense in the posterior samples (trace plot), except that mu_2 and alpha_2 fall slightly short to their expected values though definitely moving towards the right directions and outperforming the previous results. The three states seem to be relatively well-separated. In addition, we've resorted to log scale on the y-axis as we plot the posterior predictive samples, which helps us better visualize how the model picks up on different states.

brandonwillard commented 4 years ago

All right, I think we've answer most of our negative-binomial related questions, since we've show that they can be estimated sufficiently well, so I'm going to close this issue for now.