equal p(z) bins for all sources in mock_data

cristobal-sifon commented 3 years ago

The current implementation with photoz_sigma_unscaled generates a pdf that is sampled differently for every galaxy. From _compute_photoz_pdfs:

for row in galaxy_catalog:
    pdf_range = row['pzsigma']*10.
    zmin, zmax = row['z']-pdf_range, row['z']+pdf_range
    zbins = np.arange(zmin, zmax, 0.03)

so the result is a column like:

               pzbins [121]              
-----------------------------------------
 -0.9189157769563053 .. 2.681084223043698
-0.8755947879144116 .. 2.7244052120855917
 -0.9013367845355461 .. 2.698663215464457
 -1.0683426683023802 .. 2.531657331697623
 -1.0260346754041412 .. 2.573965324595862
-0.8371474311387707 .. 2.7628525688612324
 -0.7837649451484183 .. 2.816235054851585
 -1.0040270867734482 .. 2.595972913226555
  -1.1684000482540333 .. 2.43159995174597
-0.7198019908543756 .. 2.8801980091456274
                                      ...
-1.3521789249221339 .. 2.2478210750778693
-0.9578447603266889 .. 2.6421552396733143
 -0.9215149025501451 .. 2.678485097449858
-0.9071426552874368 .. 2.6928573447125665
-1.0662076783207386 .. 2.5337923216792646
-1.2636985872843276 .. 2.3363014127156756
 -1.136257420102231 .. 2.4637425798977723
 -0.7774128187340363 .. 2.822587181265967
-1.0730126785251919 .. 2.5269873214748113
-0.9649808022321504 .. 2.6350191977678525
-1.3844228023509195 .. 2.2155771976490835

This is both very inconvenient, because the user then has to interpolate if they want to combine the pdfs, and a big waste of space.

I propose to change this for a single binning scheme that depends on (zsrc_min,zsrc_max) and a new optional argument zsrc_bin_width or similar, so that the binning can be stored as a separate attribute of GalaxyCluster, e.g., zsrc_pdf_centers or just zsrc_bins. See related suggestion in #401.

johannct commented 3 years ago

If I understand well, the incriminated lines are https://github.com/LSSTDESC/CLMM/blob/b94f88bd43f490837a204a26a00bd60b37ee4a60/clmm/support/mock_data.py#L380 and subsequent. Certainly one binning per source galaxy in this context is overkill. Another possible way to proceed is to avoid the for loop and get the min and max of the galaxy_catalog['z'] itself, and then extend these zmin and zmax by pdf_range, and proceed. The resulting binning will be unique and applicable to all the galaxies in the catalog (in general I think that this for loop should disappear, but that will demand a bit of careful broadcasting).

cristobal-sifon commented 3 years ago

Yes, we can replace that for loop by the following (which I haven't tested):

zbins = np.arange(zsrc_min, zsrc_max+zsrc_bin_width, zsrc_bin_width)
pzpdf_grid = np.exp(-(zbins[:,None]-galaxy_catalog['z'])/(2*pzsigma**2) / ((2*np.pi)**0.5*pzsigma)

or maybe the transpose of that. (And by the way, pzsigma does not need to be a column in galaxy_catalog.) We do need to be careful that this does not start to produce memory issues if the arrays are too large. So perhaps it could be done within a loop but in chunks of say 100 or 1000

m-aguena commented 3 years ago

As I mentioned here, this part of the code is not really being used for now. @combet do you remeber who added that? I am not sure if there is a specific reason to have different binnings for each galaxy or if it was a naive implementation.

johannct commented 3 years ago

git blame say Matthew Kirby 2 years ago, for the for loop, and then some refactoring 8 months ago by Céline, Alex, and a couple others. There are many ways to improve the code here, but there is not much point in doing so before it is really used. And I do not think that "really using it" should wait for qp to enter the game. Quite the contrary actually IMHO.....

combet commented 3 years ago

I don't recall why we implemented it this way. I guess one question is what will happen in real life? Will photoz codes provide pdf on a single binning for all galaxies or not (it's not obvious to me). If this is the standard then I completely agree that this should be changed to something like @cristobal-sifon proposes.

johannct commented 3 years ago

I think that the code here should not care. You should be able to mock pdfs, ingest them in some TBD form by something like qp, and make sure that the code downstream does not depend on the way the pdf is represented. This is the whole point of qp. So I do not think that it matters what real life will be : here we need to abstract away from it. The only real question to my mind is how it is provided, which is a different question : e.g. inside an object catalog, or separately with some id to join to an object catalog?

m-aguena commented 3 years ago

Given this week's tag up discussion, we decided to have a more abstract approach for storing the PDF in galcat. The idea is to have the PDF column as an object type and add a pdf attribute to the galaxy catalog to describe the use of this column. So we can have something like:

pdf_description={'type': 'shared_bins', 'info': bins} -> pdf_column: array of size=n bins
pdf_description={'type': 'unique_bins'} -> pdf_column: tuple with (pdf bins, pdf vals)
pdf_description={'type': 'quantiles', 'info': (0.68, 0.95, 0.98)} -> pdf_column: tuple with quantiles (68%, 95%, 99%)

right now we can start implementing this first case with simple modification of what is in branch issue/404/galcat_pdf_improvement

eacharles commented 3 years ago

FWIW, this is really pretty heavily overlapping with functionality in qp and in particular the construction of the qp.Ensemble, which represents a set of pdfs. I.e.,

qp.Ensemble(qp.interp, data=dict(xvals=yvals, yvals=yvals)) # where xvals is an array of nPoints, yVals is an array of (nPoint, nPdfs) qp.Ensemble(qp.quant, data=dict(quants=quants, locs=locs)) # where quants are the quants (i.e., 0.68, 0.95, 0.98) and locs are the corresponding locations

It would be good to chat a bit about this.

marina-ricci commented 2 years ago

This is linked to issue #401 (interfacing 'clmm and qp)

LSSTDESC / CLMM

equal p(z) bins for all sources in mock_data #404