LSSTDESC / 2pt_validation

Repo to track progress on 2PT validation (LSS 1.1.X tasks)
4 stars 1 forks source link

Debug DoubleGaussian. #15

Closed slosar closed 8 years ago

slosar commented 8 years ago

DoubleGaussian photo-z doesn't quite work as anticipated. This is an extra from an email to Johann Cohen-Tanugi:

One is not completely free with functional forms for p(z). For start, say if there is a bump a z_true and a lesser bump at z_catastrophic, one cannot simply associate the same pdf with every galaxy at z_true, one also needs to associate the same pdf with the correct number of galaxies at z_catastrophic. In other words, for every galaxy that scatters from z_true to z_cat, there must be the right number of galaxies scattering from z_cat to z_true to make those probabilities meaningful. Moreover, if I take realizations of p(z_true) for a large number of galaxies around z_true and add them, I need to correctly enter the central limit theorem, where the total likelihood collapses into a Gaussian around z_true (most codes will, in effect, rely on this). I'm trying to write down some math to formalize these requirements.

If you do your photo-zs with full sim, i.e. generated ugrizy from z_true, fit it and get p(z_true), then these requirements will be satisfied automatically (because ugrizy generated from z_catastrophic will be presumably very similar). But if you try to cheat and go straight from z_true -> p(z), which we are doing here, you need to be careful. I tried to do this with my DoubleGaussian photo-zs toy model, see these slides

https://docs.google.com/presentation/d/1PvDOfGqh4UT3Ulasp7KClBPzFJtF7lAeKv02I4M_igA/edit?usp=sharing

but I'm not sure if this is actually correct. In fact one, can do a couple of self consistency tests. One, implemented in

./validate_fastcat/check_pz_sanity.py

implements the following: if p(z) indeed describes the proper, true p(z), and you take on galaxy, then cumulative p(z_true) should be a random number between 0 and 1. So I just calculate this for all galaxies and plot a histogram. And in fact, my DoubleGauss fails this test (but normal Gaussian passes it). I think you should make sure your code passes this. In fact, given that you have a rather limited number of p(z) shapes (ie. many fewer than we have galaxies), I suggest the following algorithm:

take a random p(z) from the library and draw a z from it. Then, find a galaxy at this z_true (+/- epsilon z) and associate this pdf with that particular galaxy. `

This means that many galaxies will have the same p(z), but this is not an issue. The problem with the above is that towards the end, you will start running out of galaxies at the right places. Then you can associate just normal perfect Gaussians with the last 10% or whatever.

slosar commented 8 years ago

Moved it to attaick, superseeded by HiddenVar