Decide photo-z bins - Githubissues

damonge commented 6 years ago

We need to decide what binning to use. Initial proposal is to explore the dependence of overall S/N on the number of bins Nbin with equal number of galaxies in each of them, and cut at a sensible value of Nbin such that no significant S/N is gained and such that it doesn't entail a ridiculous computational effort.

Tagging @humnaawan as the person that has taken the lead of this

damonge commented 6 years ago

Alright @humnaawan , this is how I'd compute the S/N for a given choice of Nbin.

Compute a 3D array with dimensions [N_ell,Nbin,Nbin], where N_ell is the number of multipoles ell in which you will calculate the power spectra (I'd suggest using all ells between 2 and e.g. 2000 or something like that). Let's call this array S. The sub-array S[:,i,j] will contain the signal cross-power spectrum between the i-th bin and the j-th bin.
Compute a second 3D array (let's call it N) with the same dimensions containing the noise power spectrum. For a given ell, this is simply a diagonal matrix with 1/n_i in the i-th diagonal element, where n_i is the number density in the i-th bin in units of 1/sterad.
Compute the total power spectrum C, given simply by C=S+N
The squared signal-to-noise ratio is then simply proportional to the sum over all ells of
```
(2*ell+1) Tr(S_ell . C_ell^-1 . S_ell . C_ell^-1)
```
(here S_ell is the element of S for multipole ell, the same for C, ^-1 means matrix inversion, Tr means "trace" and . denotes matrix multiplication - dot product). So the idea is to compute the quantity above for each ell and then sum over ells to get the S/N.

So a plot of this quantity as a function of Nbin should tell us how many bins we want to use.

Hope this is not too convoluted. Let me know if you want me to clarify anything.

humnaawan commented 6 years ago

@damonge please see the notebook for an attempt to calculate the S/N for different Nbin. I was expecting the S/N to plateau rather quickly as I increase Nbin but it doesn't; the gain decreases with more bins. Here's the last output from the notebook:

screen shot 2018-09-15 at 1 54 37 pm

I will try to re-check if I am implementing the methodology incorrectly; I have not found any bugs so far. In the meantime, a few notes/questions for you:

Is it okay to be using photoz_mc for estimating the number of galaxies in each bin for now? Once we have an estimated N(z) and dn/dz, we can swap it. Right now, the dn/dz that is estimated is being used only by CCL.
How do I get the area of the field using flatmaps? I tried reading in WIDE_AEGIS_MaskedFraction.fits and using the info from flatmaps, but the area I get is far less than 108/7 deg2. Specifically I get fskb.dxfskb.dy = 0.001, which I assume is the area of each pixel in deg2; then len(mskfrac[mskfrac>0.5])0.001 should give me the area in deg2 but I get 1.2678. This would affect the number density and hence the shot noise calculation since in the notebook, I am setting the field area to be 108/7 deg2 (based on the idea that the total WIDE area is 108 deg2 and we have 7 WIDE fields).
I would think that we will calculate the S/N for each field separately but then I will just have to pick one field to decide the bin edges, since my criterion for bin edges is that the number of galaxies in each bin should roughly be the same. Is that okay? Of course, I'll be great if the bins chosen based on one field would lead to similar number of galaxies in each bin for all other fields.

egawiser commented 6 years ago

Looks pretty reasonable so far to me. A couple of questions:

I would think that in theory we should be calculating S/N over all 7 fields as f(N_bins) and using that to optimize N_bins. It might be the case that every individual field gives the same answer, but otherwise, the global result is what we're trying to optimize. Is there an obstacle to combining fields and using the combined surface density to calculate shot noise and the combined f_sky to estimate the covariance matrix?
Speaking of which, shouldn't f_sky appear somewhere in the nice step-by-step instructions for calculating S/N?
I suspect that the result of those step-by-step instructions is actually (S/N)^2, because otherwise it doesn’t make sense that you can sum_\ell (or use the Trace to sum down the diagonal over N_bins). I would expect S/N total to behave like a quadrature sum of S/N at each (\ell,m) in each bin, and that would explain why S C^-1 gets repeated in the calculation - it's literally Tr(S/NS/N), with the sum over (2 \ell + 1) handling the m. (That matrix multiplication apparently handles covariances between bins, which are off-diagonal terms in S and C.) And this would imply that @humnaawan's plot above with values close to 4E6 is really revealing S/N~2000, which is still impressive but more reasonable-sounding for the significance at which we can rule out the hypothesis that galaxies are unclustered.

humnaawan commented 6 years ago

@egawiser there isnt any difficulty (as far as I am thinking rn) in combining all the 7 fields. I just wanted to confirm before I set up the code for it.

Also, the way I've implemented David's outline, f_sky only comes in when estimating the number density in step 2.

damonge commented 6 years ago

This is great @humnaawan !

@egawiser is right, the expression I gave you was for something proportional to the (S/N)^2 (I've corrected this above for future reference).
@egawiser is also right that the expression would be complete if you multiply it by fsky/2. I didn't want to include that in the description above because it implied the extra complication of calculating fsky, and we don't really care about the S/N, just about the degree of improvement as a function of Nbin.
I now realize that you actually need fsky to calculate the shot noise. I would compute the total patch area (in sterad) as np.sum(maskedfraction)*np.radians(fsk.dx)*np.radians(fsk.dy). The shot noise contribution to the power spectrum will just be this number divided by the total number of clean objects in that redshift bin.
Regardless of the above, I think the plot above makes a lot of sense, and is telling us that we should stop at Nbin=4. It would be really good to look at a plot that shows N(z) for the different bins for a few choices of Nbin. This would allow us to see the overlapping redshift distributions and see the correlations "by eye".
@egawiser I would expect all fields to have the same N(z), since we are using a sufficiently conservative magnitude cut that differences in depth between fields should be minimal. It is true that different systematics in different fields may affect this statement slightly, especially when you bring photo-zs in the mix, and we should certainly compare the N(z)s we get for different fields and methods. But since the purpose of this exercise is only to educate a choice of Nbins, I would be inclined to trust any results coming from the analysis of a single, sufficiently large field.

So, I would say, @humnaawan is extremely close to solving this issue once the following is addressed:

Make the analysis in one of the larger fields to avoid CV issues (maybe VVDS or XMMLSS).
Plot the square root of the previous quantity.
If possible, extend the analysis to Nbin=7 or 8, to see the plateau better.

(while writing this my mousepad betrayed me and I accidentally pressed "close and comment", sorry about that!)

damonge commented 6 years ago

A few other minor comments, now that I'm looking at @humnaawan 's notebook.

Most important: for binning, I would use the maximum likelihood redshift, and not z_MC.
I think you're computing the N(z) using the pdfs, and I imagine get_dn_dz is a bit slow. If this is a drag computationally, it would be perfectly fine to use histograms of z_MC for this exercise, instead of pdf stacks, as you're using now.
If timing is an issue, there are a few ways you may be able to accelerate the CCL part, just let me know. It may also be good for you to look at the auto-power spectra (i.e. the diagonal of S) for one of the choices of Nbin, just to check that the power spectra aren't nonsense.

The notebook looks good otherwise. You're becoming a CCL expert!

humnaawan commented 6 years ago

@damonge @egawiser here's a short rundown of the changes that are implemented in the code since my last post, following your comments:

I am now looking at the actual S/N, which is sqrt(fsky/2 * the quantity that @damonge wrote out above); implemented here. Here, fsky is (patch area)/4pi; calculated here.
I have switched from using z_MC as the estimator for photo-zs. We now have the functionality to choose between z_mode and z_best.
dn/dz calculation using PDF stacking was not computationally prohibitive when calculating the SN for the smaller WIDE field, AEGIS, but it became so for the larger ones. I've incorporated the functionality to specify using PDF stacking vs. z_MC histograms for dn/dz, and apply the latter when calculating the SN from the bigger fields.
The total calculation time for a given choice of point estimator (z_best or z_mode) for a field (for both photo-z algorithms considered here) is ~20min when calculating the SN for Nbin=1-10. Therefore, I have not optimized the CCL part. @damonge please let me know if you think this is something I should optimize.

I have re-printed some representative results in this notebook, while all the actual output plots (and the sbatch outputs) are in /global/cscratch1/sd/awan/lsst_output/hsc_output/. Here are some highlights:

SN for Nbin 1-10 for the two large WIDE fields, using z_best as the point estimator:

wide-vvds_zbest-based_nz-from-zmcs_sn_10bins

wide-xmmlss_zbest-based_nz-from-zmcs_sn_10bins

We see that the results are largely the same across the two fields, with SN starting to plateau for ~Nbin>7. The two algorithms (ephor_ab and franken_z) are giving similar results and we have similar trends when using z_mode as the point estimator (please see Output[12] in the notebook for comparison plots).

Here's the dn/dz for the larger of the two field, XMMLSS, for Nbin=4. ephor_ab binning is such that we actually get pretty much the same number of galaxies in different bins; franken_z gives similar numbers in the end but has more variation in the number of galaxies in each bin.

wide-xmmlss_ephor-ab_zbest-based_nz-from-zmcs_dndz_4bins

So we have 0.15, 0.50, 0.76, 1.0, 1.5 as the bin edges for Nbin=4,. We get the same bin edges from VVDS; see Output[13] in the notebook.

Here's the same plot as above, now for Nbin=5:

wide-xmmlss_ephor-ab_zbest-based_nz-from-zmcs_dndz_5bins

So we have 0.15, 0.47, 0.65, 0.86, 1.10, 1.5 as the bin edges for Nbin=5. We get the same bin edges from VVDS; see Output[14] in the notebook.

Also, just to check things, I ran the analysis for the smaller field, AEGIS, both with dn/dz calculated from PDF stacking and using the z_mc histograms. The results are qualitatively the same, which is reassuring; see Output[11] in the notebook.
The notebook also shows plots for the auto (and cross) spectra for Nbin-4 in Output[16]. The auto spectra look reasonable, which is also reassuring.

As @damonge pointed out during our discussions today, doing the analysis for more than 5 bins would be rather computationally prohibitive, so the idea is to use 4-5 bins, with bin edges from the larger fields when using ephor_ab z_best data as point estimators. I'll create the files to use the bin edges for 4 and 5 bins in cat_sampler.py.

Please let me know if there are any concerns.

damonge commented 6 years ago

Awesome @humnaawan! Definitely no need to optimize anything further. We only need this to kind of guide our choice of bins, so this is perfect.

More than 5 bins is not necessarily prohibitive, I just doubt we'll get much better results, since we'll need to include nuisance parameters for any new bin (which is something that isn't quantified here). So I'd say, let's run with 4 or 5, and we can try something else later if we get ambitious.

Closing!

egawiser commented 6 years ago

Given the issue of nuisance parameters and the modest S/N improvement above for N>4, I'd suggest N=4 as a baseline, with N=6 as a "stretch goal" to be tried if we'd like to modestly improve upon the N=4 results. I also note that N=4 would allow bin edges of 0.15, 0.50, 0.75, 1.0, 1.5 that look nicely rounded and would result in almost perfect equipartition of the number of galaxies per bin.

LSSTDESC / DEHSC_LSS

Decide photo-z bins #16