annayqho / TheCannon

a data-driven method for determining stellar parameters and abundances from stellar spectra
MIT License
39 stars 16 forks source link

Scatter estimates #42

Closed tingyuansen closed 9 years ago

tingyuansen commented 9 years ago

The code estimates the "scatter" term with a grid search. There might be a better way to get that. The trick is to assume an inverse Gamma conjugate prior and choose the limit where inverse Gamma converges to a uniform distribution. In this case, both the best estimate of the coefficient and the scatter term can be derived analytically.

davidwhogg commented 9 years ago

I don't think we can implement the analytic method of @tingyuansen -- that requires that the input data are heteroskedastic, which they are not. I think we should either close this, or re-title it to "faster optimization of scatter estimates". I leave it up to @annayqho

tingyuansen commented 9 years ago

Just to make sure I understand. I suppose David meant the data have to be homoskedastic? The heteroskedasticity is exactly why this estimate does not work.

davidwhogg commented 9 years ago

yes, what @tingyuansen says

tingyuansen commented 9 years ago

I disagree that the implemented code is beyond this analytic formula. In "_do_one_regression", what it does is averaging the chi2 distances (over all wavelength and spectrum). This is mathematically exactly the same as the analytic formula. In fact, in the analytic formula, we only average over the chi2 distance in all spectrum, for a fixed wavelength, i.e s=s(lambda), a function of the wavelength. Thus, I would argue that the analytic formula is actually more flexible (fewer assumptions) than the grid search.

davidwhogg commented 9 years ago

I am not sure what you mean. Let's work with math -- send a pdf file by email to all of us with the math of what you think the grid search is doing and what the analytic formula does. If what the code currently does can be solved with an analytic formula, there is a bug in the code, because what it is supposed to be doing is finding the maximum likelihood for

logL = -0.5 * sum((data - mean)**2 / (sigma**2 + s**2)) - 0.5 * sum(log(sigma**2 + s**2))

where mean has parameters, and s is a parameter. When the sigma values are different for every data point (as they are in our case), this has no analytic form, or if it does, you need to publish that asap!

But as I say, let's move to a pdf document, where you say what you think the do_one_regression code is doing, and what the analytic formula would do. @annayqho can you work with @tingyuansen to check that you agree that everything is the way it should be?

tingyuansen commented 9 years ago

Hi David,

For the coefficient errors (in the other post), I agree that I should write down the math more carefully in a pdf.

As for this post, I think the situation is more "motivational" than "mathematical".

I agree that the exact formula you stated above has no analytic formula. But if we substitute (sigma^2 + s^2) = q^2 , I think we agree that q^2 has an analytic solution (from the pdf I sent last week). The analytic formula still holds when q is a function of lambda.

Perhaps my question is that I haven't fully grasped the physical motivation to parse the error term in sigma^2 + s^2. In the current grid search, s is independent of both the labels and lambda. It seems to me that q^2 formalism might be cleaner, more flexible (because it is a function of lambda) and perhaps more physically relevant -- i.e. how much intrinsic scatters are there among the labels, given a fixed lambda. Furthermore, it has an analytic formula.

I must be missing something, but I couldn't figure out what is the advantage using s instead of q.

Cheers, YS

davidwhogg commented 9 years ago

No, I don't agree -- sigma is a N-element vector so sigma2 + s2 is an N-element vector. I don't see this as having an analytic formula. Every sigma is different -- heteroskedastic! Does that answer your question about motivation?

annayqho commented 9 years ago

@tingyuansen , I'll comment on some of your remarks and then explain the motivation for the division of terms.

Re: your remarks, -- you're right that the code steps through each potential value of scatter and calculates the chi squared of the resulting best-fit model. However, your line "in the current grid search, s is independent of both the labels and lambda" worries me. s is in fact dependent on lambda (or at least, it should be!) as it is solved for in each separate map() call. Look at line 168 of train_model: the map() function is called, which calls do_one_regression for each pixel separately and thus solves for a scatter value at each pixel separately. -- "The analytic formula still holds when q is a function of lambda" -- but q would be a function of object AND of lambda, right? because sigma is the observational uncertainty, which is simply the error on each pixel reported by APOGEE. So even holding lambda fixed, q varies along the object axis.

Motivation for division of terms: they are conceptually different, and differ in what quantities they depend on. Both depend on lambda, but while the scatter of the model is the same for all objects, the observational uncertainty is permitted to differ between objects. This is what distinguishes our model from machine learning, and what gives it the flexibility to accommodate e.g. both high SNR and low SNR objects.

tingyuansen commented 9 years ago

Hi Anna & David,

Many thanks.

Maybe I don't understand what map() is doing. My understanding is that the blob sends all arrays to "_do_one_regression_at_fixed_scatter". This function gives an array of "chi". All values in the "chi" array are then summed in line 88 "chis_eval[jj] = np.sum(chi*chi) - logdet_Cinv". If this is the case, the chis_evals are summed over all objects and lambda. However, if you confirm that the map parses the array such that the sum is over objects but not lambda, then I should rest my case :).

Assuming that this is the case, s^2 and q^2 are both only functions of lambda (averaged over all the objects). In fact, it is intuitive clear that there is no way to derive s (or q) as a function of both lambda and objects -- in this case, there as many parameters (#lambda * #objects) as the observed values (the flux values as a function of lambda and objects). That's also why we couldn't generalize the analytic formula from an Inverse Gamma prior to an Inverse Wishart prior (which takes care of heteroskedasticity). Since s is only a function of lambda, I might disagree that s solves the heteroskedasticity in full.

Regarding Anna question on q - the analytic formula gives an estimate of q only as a function of lambda (not objects). The analytic formula does not care about sigma^2, I were just trying to link my "q" to David's "s".

I agree with David that, s^2 and q^2 are slightly different, s^2 takes care of the observed errors (giving less "weight" for noisy data) while estimating the bias, while q^2 estimates the bias as a whole, ignoring the noise.

If we want to approximate "s" with the analytic formula for "q", one can imagine giving different weight to each spectrum (by "copying" the spectrum with high S/N) and applying the analytic formula. In this case, I believe the analytic formula for q should be approximately the same as s (not exactly because s subtracts the noise, instead of giving less weight). Could this be what David asked for (analytic formula for s)?

Thanks.

p/s: I agree with Hans-Walter that we should hit the pause button on this discussion.

Cheers, Yuan-Sen

davidwhogg commented 9 years ago

If there is no heterskedastic sigma**2 in do_one_regression, then the code has a bug in it!