LSSTDESC / DEHSC_LSS

Scripts and workflow relevant to HSC data
1 stars 3 forks source link

Decide photo-z bins #16

Closed damonge closed 6 years ago

damonge commented 6 years ago

We need to decide what binning to use. Initial proposal is to explore the dependence of overall S/N on the number of bins Nbin with equal number of galaxies in each of them, and cut at a sensible value of Nbin such that no significant S/N is gained and such that it doesn't entail a ridiculous computational effort.

Tagging @humnaawan as the person that has taken the lead of this

damonge commented 6 years ago

Alright @humnaawan , this is how I'd compute the S/N for a given choice of Nbin.

  1. Compute a 3D array with dimensions [N_ell,Nbin,Nbin], where N_ell is the number of multipoles ell in which you will calculate the power spectra (I'd suggest using all ells between 2 and e.g. 2000 or something like that). Let's call this array S. The sub-array S[:,i,j] will contain the signal cross-power spectrum between the i-th bin and the j-th bin.
  2. Compute a second 3D array (let's call it N) with the same dimensions containing the noise power spectrum. For a given ell, this is simply a diagonal matrix with 1/n_i in the i-th diagonal element, where n_i is the number density in the i-th bin in units of 1/sterad.
  3. Compute the total power spectrum C, given simply by C=S+N
  4. The squared signal-to-noise ratio is then simply proportional to the sum over all ells of
    (2*ell+1) Tr(S_ell . C_ell^-1 . S_ell . C_ell^-1)

    (here S_ell is the element of S for multipole ell, the same for C, ^-1 means matrix inversion, Tr means "trace" and . denotes matrix multiplication - dot product). So the idea is to compute the quantity above for each ell and then sum over ells to get the S/N.

So a plot of this quantity as a function of Nbin should tell us how many bins we want to use.

Hope this is not too convoluted. Let me know if you want me to clarify anything.

humnaawan commented 6 years ago

@damonge please see the notebook for an attempt to calculate the S/N for different Nbin. I was expecting the S/N to plateau rather quickly as I increase Nbin but it doesn't; the gain decreases with more bins. Here's the last output from the notebook:

screen shot 2018-09-15 at 1 54 37 pm

I will try to re-check if I am implementing the methodology incorrectly; I have not found any bugs so far. In the meantime, a few notes/questions for you:

egawiser commented 6 years ago

Looks pretty reasonable so far to me. A couple of questions:

  1. I would think that in theory we should be calculating S/N over all 7 fields as f(N_bins) and using that to optimize N_bins. It might be the case that every individual field gives the same answer, but otherwise, the global result is what we're trying to optimize. Is there an obstacle to combining fields and using the combined surface density to calculate shot noise and the combined f_sky to estimate the covariance matrix?
  2. Speaking of which, shouldn't f_sky appear somewhere in the nice step-by-step instructions for calculating S/N?
  3. I suspect that the result of those step-by-step instructions is actually (S/N)^2, because otherwise it doesn’t make sense that you can sum_\ell (or use the Trace to sum down the diagonal over N_bins). I would expect S/N total to behave like a quadrature sum of S/N at each (\ell,m) in each bin, and that would explain why S C^-1 gets repeated in the calculation - it's literally Tr(S/NS/N), with the sum over (2 \ell + 1) handling the m. (That matrix multiplication apparently handles covariances between bins, which are off-diagonal terms in S and C.) And this would imply that @humnaawan's plot above with values close to 4E6 is really revealing S/N~2000, which is still impressive but more reasonable-sounding for the significance at which we can rule out the hypothesis that galaxies are unclustered.
humnaawan commented 6 years ago

@egawiser there isnt any difficulty (as far as I am thinking rn) in combining all the 7 fields. I just wanted to confirm before I set up the code for it.

Also, the way I've implemented David's outline, f_sky only comes in when estimating the number density in step 2.

damonge commented 6 years ago

This is great @humnaawan !

  1. @egawiser is right, the expression I gave you was for something proportional to the (S/N)^2 (I've corrected this above for future reference).
  2. @egawiser is also right that the expression would be complete if you multiply it by fsky/2. I didn't want to include that in the description above because it implied the extra complication of calculating fsky, and we don't really care about the S/N, just about the degree of improvement as a function of Nbin.
  3. I now realize that you actually need fsky to calculate the shot noise. I would compute the total patch area (in sterad) as np.sum(maskedfraction)*np.radians(fsk.dx)*np.radians(fsk.dy). The shot noise contribution to the power spectrum will just be this number divided by the total number of clean objects in that redshift bin.
  4. Regardless of the above, I think the plot above makes a lot of sense, and is telling us that we should stop at Nbin=4. It would be really good to look at a plot that shows N(z) for the different bins for a few choices of Nbin. This would allow us to see the overlapping redshift distributions and see the correlations "by eye".
  5. @egawiser I would expect all fields to have the same N(z), since we are using a sufficiently conservative magnitude cut that differences in depth between fields should be minimal. It is true that different systematics in different fields may affect this statement slightly, especially when you bring photo-zs in the mix, and we should certainly compare the N(z)s we get for different fields and methods. But since the purpose of this exercise is only to educate a choice of Nbins, I would be inclined to trust any results coming from the analysis of a single, sufficiently large field.

So, I would say, @humnaawan is extremely close to solving this issue once the following is addressed:

(while writing this my mousepad betrayed me and I accidentally pressed "close and comment", sorry about that!)

damonge commented 6 years ago

A few other minor comments, now that I'm looking at @humnaawan 's notebook.

The notebook looks good otherwise. You're becoming a CCL expert!

humnaawan commented 6 years ago

@damonge @egawiser here's a short rundown of the changes that are implemented in the code since my last post, following your comments:

I have re-printed some representative results in this notebook, while all the actual output plots (and the sbatch outputs) are in /global/cscratch1/sd/awan/lsst_output/hsc_output/. Here are some highlights:

wide-vvds_zbest-based_nz-from-zmcs_sn_10bins

wide-xmmlss_zbest-based_nz-from-zmcs_sn_10bins

We see that the results are largely the same across the two fields, with SN starting to plateau for ~Nbin>7. The two algorithms (ephor_ab and franken_z) are giving similar results and we have similar trends when using z_mode as the point estimator (please see Output[12] in the notebook for comparison plots).

wide-xmmlss_ephor-ab_zbest-based_nz-from-zmcs_dndz_4bins

So we have 0.15, 0.50, 0.76, 1.0, 1.5 as the bin edges for Nbin=4,. We get the same bin edges from VVDS; see Output[13] in the notebook.

wide-xmmlss_ephor-ab_zbest-based_nz-from-zmcs_dndz_5bins

So we have 0.15, 0.47, 0.65, 0.86, 1.10, 1.5 as the bin edges for Nbin=5. We get the same bin edges from VVDS; see Output[14] in the notebook.

As @damonge pointed out during our discussions today, doing the analysis for more than 5 bins would be rather computationally prohibitive, so the idea is to use 4-5 bins, with bin edges from the larger fields when using ephor_ab z_best data as point estimators. I'll create the files to use the bin edges for 4 and 5 bins in cat_sampler.py.

Please let me know if there are any concerns.

damonge commented 6 years ago

Awesome @humnaawan! Definitely no need to optimize anything further. We only need this to kind of guide our choice of bins, so this is perfect.

More than 5 bins is not necessarily prohibitive, I just doubt we'll get much better results, since we'll need to include nuisance parameters for any new bin (which is something that isn't quantified here). So I'd say, let's run with 4 or 5, and we can try something else later if we get ambitious.

Closing!

egawiser commented 6 years ago

Given the issue of nuisance parameters and the modest S/N improvement above for N>4, I'd suggest N=4 as a baseline, with N=6 as a "stretch goal" to be tried if we'd like to modestly improve upon the N=4 results. I also note that N=4 would allow bin edges of 0.15, 0.50, 0.75, 1.0, 1.5 that look nicely rounded and would result in almost perfect equipartition of the number of galaxies per bin.