andycasey / trex

0 stars 0 forks source link

Distribution choices and propagation of uncertainty #1

Open dfm opened 3 years ago

dfm commented 3 years ago

Hey Andy! @quadrychance and I have been thinking about how to propagate uncertainty and I did a few experiments last night and have some thoughts/questions.

The main question is about the choice of distributions. I think that the model in one bin of color-magnitude should be something like the following:

There is some true jitter for each source that is either exactly sigma_0 (the unknown measurement uncertainty) or drawn from some distribution with a heavy tail. Then the observed jitter (or really the observed variance) will be distributed according to (ref: classical stats)

N * sigma_obs^2 / sigma_true^2 ~ chi2(N-1)

where N is the number of transits. This will be chi2 instead of normal because (I think) the measurement being made is approximately estimating a RMS error given N residuals. I was wondering if it would be possible/worth it to adjust your Stan model to include this observation model. We could still model the true distribution of jitters as a normal + log_normal, but it would add a set of N_targets parameters which would be the true jitters for each source. In principle, Stan should be able to handle such things, but I did a few very basic tests in PyMC3 and found it to be not entirely trivial.

This might be worth it because then, for each star, you would be taking into account the uncertainty introduced by the sampling for each target and it would propagate the uncertainty to the confidence with which you label each star as a single or outlier. Then, we could use the same model to determine the classification for new stars.

Have you thought about this and decided to skip it (it will certainly be more computationally expensive) or do you have other thoughts?

andycasey commented 3 years ago

I haven't thought of this! That's why they pay you the big bucks!

Would this mean that we would need to model all sources at once to properly estimate the individual source jitter? (Because right now we model some subset of sources and use that to build a GP.) If we needed to model all sources simultaneously then that model would still include some parameterisation to describe how the parameters of the normal + log_normal vary as a function of (bp_rp, absolute_g_mag, phot_g_mag), right?

(Also, sorry for the lateness on this!)