find the maximum marginalized likelihood p(z)

davidwhogg commented 8 years ago

You are sampling in p(z) or N(z) space. Let's also try optimizing in p(z) space. Here you drop the prior over p(z) and just optimize it, but marginalizing out the redshifts of all the individual galaxies. This should be easier than sampling. If you have trouble with the math, let me know. As for optimizers, you can just use scipy.minimize.

aimalz commented 8 years ago

I am calculating the maximum likleihood parameter values before sampling, but I'm stuck on a little problem. scipy.optimize fails to yield sensible maximum likelihood N(z) parameters, wherein some elements of the parameter vector go to totally unreasonable values. I've tried all the unconstrained methods (fmin, fmin_bfgs, etc.) and they all suffer from the same problem. I'm still working on it.

davidwhogg commented 8 years ago

But is the likelihood going up or not? Ie, is it an optimizer error? If so, file bug reports against scipy. I bet it isn't -- I bet it is an issue with the likelihood function, or else a think-o about what should be the right answer.

Also, can you show me an example, instead of describing in words, and show me why you think the answer is wrong?

ixkael commented 8 years ago

One simple check you could try before feeding the likelihood to scipy (which fails to converge) is to manually compute the likelihood on a coarse grid. Since you know the ground truth you should be able to construct a coarse grid and verify that the likelihood looks OK.

davidwhogg commented 8 years ago

excellent idea

aimalz commented 8 years ago

Agreed! When I do it this way there's far fewer crazy values (zeroes where nonzeroes should be expected), but I'm also making it quite coarse because it's incredibly slow. I've plotted an example of the maximum likelihood N(z) here: truenz Question I'm thinking about: do we expect the maximum likelihood to do better or worse than stacking/MAP? It comes out to be comparable in these small test cases.

davidwhogg commented 8 years ago

Can you also give the marginalized likelihood values (the scalar objective) for each of the curves you plot? So we get the two KLDs and also the marginalized likelihood?

And can you just give things to, say, 2 digits of accuracy?

aimalz commented 8 years ago

The figure above has been updated to show the marginalized likelihood values and the KLDs, with fewer digits (duh).

davidwhogg commented 8 years ago

I do not understand how the interim can only be exp(-4) worse than the true answer. That makes no sense at all to me. How many data points do you have? Galaxies, that is?

aimalz commented 8 years ago

This was with a Poisson draw around 100 galaxies (see plot title), but the issue persists with larger survey sizes because the likelihood is dominated by the Poisson term (normalization to survey size). I know there are big issues with using a small survey size, but I like to make sure the code works on small survey sizes before running a more time-consuming test.

davidwhogg commented 8 years ago

What's time consuming about comparing the truth to the interim prior for a large number of galaxies? Shouldn't be expensive at all. If it is: Bug.

aimalz commented 8 years ago

Whoops, am getting confused between the three active issues. Yes, this one comes from generating the data and doesn't require the slow sampler.

davidwhogg commented 8 years ago

So, can you check the asymptotics of the marginalized likelihood difference between interim and truth. And tell us how long it takes to compute the ML as a function of number of galaxies, and also is the optimizer "failing" because it hits the maxfev or maxiter? Because these are not failures of the optimizer. Oh and in other issue threads, it looked like your Max-like likelihood was worse than some others you have, which is impossible / crazy -- initialize at your best current guess and optimize from there.

aimalz commented 8 years ago

Just to clarify, what do you mean by asymptotics here? I've checked that the likelihood improves the closer we go towards the true value from the interim prior.

The optimizer failed because it hit maxiter (even for high values of maxiter) before converging. I tried constrained optimization and had much better results, in that I consistently got an answer in a reasonable time. It's still having a problem with finding local extrema instead of a global maximum likelihood, but it consistently improves from wherever I start it.

The bizarre-looking max likelihood value actually comes out the same for one of the optimizers I tried and for the grid search, so I'm inclined to believe it's actually a legit local extremum. But with a different optimizer I get more sensible values, even if they're not likely to be the global maximum either.

davidwhogg commented 8 years ago

Can you make a plot demonstrating your first assertion? You had implied the opposite earlier. As I repeat consistently, we always want to get answers in visual form, not word form.

aimalz commented 8 years ago

Sorry for not responding sooner, I spent all day chasing down a "bug" that made it look like the samples were systematically biased but was actually a product of setting the random number generator's seed.

Here's a plot showing the likelihood as a function of "fractional truth" of test values of the parameters, where the test values are a weighted average of the true value and the interim value quantified by the weight placed on the true values. Likelihood generally increases for values closer to the truth except when approaching the true value exactly, when many nearby values form local minima in likelihood space and the generalization breaks down. liktest

davidwhogg commented 8 years ago

Okay good. Here's how I think we should proceed to optimize the marginalized likelihood:

do all the simple, easy things, like stacking the posteriors and histogramming the MAP redshifts.
also throw in some random draws from the prior.
check the marginalized likelihood of each of these trials.
optimize, starting at the best one.

when optimizing, set maxiter = np.Inf and maxfev = np.Inf

Are you giving the optimizer the derivatives of the likelihood? If the optimization is very slow we will do that next.

Can you confirm that this plan works? I mean by executing it and telling me how long it takes, along with how many galaxies you used and so on?

davidwhogg commented 8 years ago

Oh and one more thing, can you make the plot you posted today (above) for a sample with 10x as many galaxies? I think we can predict what that should look like, given this plot.

aimalz commented 8 years ago

I set the prior to have a mean equal to the MLE optimized starting from the best of the interim prior, MAP N(z), and stacked N(z), so it was a little awkward to redo the optimization including samples from the prior. It also doesn't make any difference compared to just starting from the best of the original candidates without the samples from the prior. Once I chose a good constrained minimizer (scipy.optimize.fmin_slsqp), optimization was fast (~3 s for 10000 galaxies) without having the derivatives of the likelihood.

Here's the same plot for 10000 galaxies: (Sorry the y axis label is cut off.) liktest

That's pretty much exactly what I expect it to look like, the same shape as for fewer galaxies but with likelihoods scaled by the Poisson penalty term increasing.

davidwhogg commented 8 years ago

Shouldn't this peak at 1.0? Isn't 1.0 the truth?

davidwhogg commented 8 years ago

also fix the x axis so I can read the label, label the y axis.

Also, how can your prior depend on the MLE? The prior is supposed to be set prior to seeing any data. Let's obey those rules, since the whole point of this paper is to be principled.

aimalz commented 8 years ago

The truth is consistently not the MLE, but something close to the truth is. Given how noisified the data is, I wouldn't expect the truth to actually be the best fit to the data, just something close.

The prior was just set to a multivariate Gaussian with some mean and covariance. I calculate the stacked, MAP, and MLE values and can set the mean of the prior to be anything, be it a flat N(z), the interim prior, or any of the other estimators' results. On principle I'll set it to something that's not calculated from the data.

davidwhogg commented 8 years ago

There are proofs -- you should know them; look them up. Asymptotically, the maximum marginalized likelihood should be an unbiased, efficient estimator. It doesn't look right here. Increase the number of galaxies by 10 again and plot it again. I smell bug.

What are the two models you are "mixing" in these plots? And can you plot the truth p(z), the thing you are mixing it, and the mixture that gets the best MLE on one plot?

ixkael commented 8 years ago

I agree with Hogg. In fact, the fact that the MLE is systematically offset from the truth is suspicious. In one realisation this is true. But if you 1) increase the amount of data, or 2) average over multiple simulations, the MLE should converge very rapidly to the true values of the parameters.

In fact this is the first thing I do when I write a piece of code. I draw points from the likelihood and check that I can recover the input parameters at very high precision, by increasing the amount of data linearly, and also averaging over instances of the simulation. This process should also give error bars that are very close to the theoretical expectation (the curvature/Fisher matrix).

Of course this doesn't specifically mean that your likelihood is wrong or that there is a major code bug. This could always be due to a subtlety somewhere else. But there must be a very rational explanation. But in the absence of such a justification, having a likelihood optimisation that does not work on simulated data generated by the same likelihood is very suspicious.

But maybe you are doing something slightly different, in which case what I wrote above does not have to be true!

aimalz commented 8 years ago

Ok, I figured out what was wrong. I was suspicious primarily because it got worse with more galaxies when it should have gotten better. The proofs that state that the MLE is unbiased are based on the assumption that the true redshift is contained within the zPDF, but to cut corners I'd restricted the dimensionality such that this was not always true, and more galaxies meant it was less true. The mix of the interim prior and the truth that you asked for is what's in the plot. The figure has been updated.

davidwhogg commented 8 years ago

I think that all the proofs require is that the model be correct: Ie, that the LFs you are using represent the statistical process by which the data are generated. Is this true or not true (of your code)? And if previously it was not true, how so and why? Also, can you use some more careful terminology? I don't understand the words "true redshift is contained within the zPDF". I doubt this is a requirement, but I don't know what the zPDF is.

Can you extend your figures through x=1, so we see the full peak? I think that is a straightforward generalization of what you are doing?

I don't understand what the change to your code is; I am going to open a new issue on this.

davidwhogg commented 8 years ago

One more thing -- in the future, don't over-write wrong / old figures. Now this issue thread is unreadable and wrong. For the future, post new figures.

aimalz commented 8 years ago

The likelihood function I used is based on the individual posteriors and the interim prior, not the process by which individual posteriors were generated. This is given in the equation for the full posterior once the prior term is removed. Is this not the likelihood you're asking about?

I'm using "zPDF" to refer to the individual redshift posteriors. I introduced the shorthand in my paper but can use the full phrase if you prefer.

Extending the figure like that doesn't make sense, because 1 represents the true N(z); all the points correspond to mixtures of the true N(z) and the interim prior.

davidwhogg commented 8 years ago

I don't understand what you are saying: Does your likelihood function at x=1 in the plots above correspond to the method by which the data were generated? If not, you are not doing the experiment we want.

How could it ever be that the true redshift has no probability in the zPDF?

Okay, if what I suggest makes no sense (which is false) then figure out something that shows a PEAK at the TRUTH. That's all I want!! What I say makes sense because you are testing

theta = q * theta_true + (1 - q) * theta_wrong

there is no requirement that q <= 1. Try, for example, q=1.1!

You only hit a problem if things go negative, but that won't happen immediately at q=1.001.

aimalz commented 8 years ago

The likelihood at x=1 is the likelihood of the true N(z). The likelihood is calculated from the zPDFs and the interim prior, not the process by which zPDFs are generated. Does that make sense? I worked out the likelihood based on the data generation process (which involves hyperparameters of its own) a couple weeks ago and determined it was a Dirichlet distribution, but I didn't think that was what you were asking for.

The zPDF is centered around an "observed" redshift that is shifted from the true redshift by a Gaussian random variable. The zPDF also has an "observed" variance that is another Gaussian random variable. It is possible (rarely) to have a standard deviation that is less than the shift from the truth (especially when the shift takes the "observed" redshift out of the bins over which zPDFs are defined), so there may be very little probability of the true redshift within the zPDF. I had initially maintained expanding the bins so all "observed" redshifts were included but turned that off to restrict dimensionality for quicker testing. It turns out that functionality is essential for the math to work, so that's not a corner I can cut.

Oops, brain fart, I wasn't thinking about q > 1. Here's a demonstration that the likelihood peaks at the true value: liktest

davidwhogg commented 8 years ago

I still don't understand -- an object that is, say, 5-sigma away from z_true should happen about once in 10^7 draws, so that isn't the problem. The edges are also not problems: Say z_obs is outside the range of z_true: So what? the p(z) will still have all of it's weight inside the z_true region, since (presumably) the interim prior has the same support as the true prior. If it doesn't, make it so! Do you see what I mean? No math is different in any way if z_obs is outside the z_true range.

Oh and good on the peaking at x=1 thing.

aimalz commented 8 years ago

The problem was resolved by allowing the interim prior to be nonzero for all bins spanning {z_obs}. When I set it to zero for bins outside {z_true} it artificially boosted signal in the endpoint bins whose distributions were cut off before normalization. The inappropriate interim prior also entered the likelihood function, which is why it was peaking in the wrong place; the combination of truth and interim prior that had more weight in the middle but also significant weight in the tails was a better fit because the data were wrong. That's the only explanation I can think of for why the problem was resolved by making that one change.

davidwhogg commented 8 years ago

That sounds like a problem. As long as you use the interim prior correctly, and as long as the interim prior and the truth have the same support, I don't think it matters what the interim prior is. There was nothing "inappropriate" about the interim prior cutting off outsize the z_true range. After all, it is a prior on the true redshifts, not the observed.

davidwhogg commented 8 years ago

As a test: Can you change the interim prior to something that is curved -- like a parabola in z_true? Everything should work no matter what shape we have for the interim prior, so long as there is support.

davidwhogg commented 8 years ago

wait -- don't do that here; I am opening a new issue for that.

aimalz commented 8 years ago

I think you're right about why my explanation is wrong, but I don't have a better idea for why not cutting off the bins to a predetermined dimensionality fixes the problem. Even though the support is unaffected, it seemed like it must have been some kind of support issue because expanding the redshift range beyond the true redshifts was the solution.

davidwhogg commented 8 years ago

That's a bug, I think.

aimalz commented 8 years ago

Agreed -- I found a bug in which I had set unnecessarily narrow bounds for the constrained optimization calculating the maximum likelihood. The bounds were linked to the number of bins, so the resulting MLE appeared to be better for more bins. The bug has since been squashed.

davidwhogg commented 8 years ago

Can you demonstrate that visually -- by showing that you get the right answer when you don't have the extended support for the prior?

Also, given that this bug was only found in investigating this issue, how are you going to demonstrate to me (and, worse, the skeptical outside world) that you have found all the bugs of this type?

aimalz commented 8 years ago

That's what's in https://github.com/aimalz/prob-z/issues/37.

I need to be able to justify all the constants I use, such as bounds in a constrained optimization problem. This one was obviously wrong, but others are more subtle. You pointed out that my individual redshift posteriors are worse than what we'd expect from real datasets. That's because of the constants I chose to go into the variance of the "observed" redshift relative to the true redshift and the standard deviations of the Gaussian components. Currently these choices are fairly arbitrary.

aimalz commented 8 years ago

Just an update: the bug has mysteriously returned. I have plots proving that it was resolved for a while, but code backed up at that point no longer produces the right answer. I fear the culprit may have been .pyc files not updating between runs. I've been trying to figure it out since yesterday. EDIT: I found an error in plotting. All is well. The paper has been updated but still lacks relevant figures.

aimalz / prob-z

find the maximum marginalized likelihood p(z) #33