Closed cristobal-sifon closed 1 year ago
If I understand well, the incriminated lines are https://github.com/LSSTDESC/CLMM/blob/b94f88bd43f490837a204a26a00bd60b37ee4a60/clmm/support/mock_data.py#L380 and subsequent. Certainly one binning per source galaxy in this context is overkill. Another possible way to proceed is to avoid the for loop and get the min and max of the galaxy_catalog['z']
itself, and then extend these zmin and zmax by pdf_range
, and proceed. The resulting binning will be unique and applicable to all the galaxies in the catalog (in general I think that this for loop should disappear, but that will demand a bit of careful broadcasting).
Yes, we can replace that for loop by the following (which I haven't tested):
zbins = np.arange(zsrc_min, zsrc_max+zsrc_bin_width, zsrc_bin_width)
pzpdf_grid = np.exp(-(zbins[:,None]-galaxy_catalog['z'])/(2*pzsigma**2) / ((2*np.pi)**0.5*pzsigma)
or maybe the transpose of that. (And by the way, pzsigma
does not need to be a column in galaxy_catalog
.) We do need to be careful that this does not start to produce memory issues if the arrays are too large. So perhaps it could be done within a loop but in chunks of say 100 or 1000
As I mentioned here, this part of the code is not really being used for now. @combet do you remeber who added that? I am not sure if there is a specific reason to have different binnings for each galaxy or if it was a naive implementation.
git blame say Matthew Kirby 2 years ago, for the for loop, and then some refactoring 8 months ago by Céline, Alex, and a couple others. There are many ways to improve the code here, but there is not much point in doing so before it is really used. And I do not think that "really using it" should wait for qp to enter the game. Quite the contrary actually IMHO.....
I don't recall why we implemented it this way. I guess one question is what will happen in real life? Will photoz codes provide pdf on a single binning for all galaxies or not (it's not obvious to me). If this is the standard then I completely agree that this should be changed to something like @cristobal-sifon proposes.
I think that the code here should not care. You should be able to mock pdfs, ingest them in some TBD form by something like qp, and make sure that the code downstream does not depend on the way the pdf is represented. This is the whole point of qp. So I do not think that it matters what real life will be : here we need to abstract away from it. The only real question to my mind is how it is provided, which is a different question : e.g. inside an object catalog, or separately with some id to join to an object catalog?
Given this week's tag up discussion, we decided to have a more abstract approach for storing the PDF in galcat. The idea is to have the PDF column as an object type and add a pdf attribute to the galaxy catalog to describe the use of this column. So we can have something like:
pdf_description={'type': 'shared_bins', 'info': bins}
-> pdf_column
: array of size=n binspdf_description={'type': 'unique_bins'}
-> pdf_column
: tuple with (pdf bins, pdf vals)pdf_description={'type': 'quantiles', 'info': (0.68, 0.95, 0.98)}
-> pdf_column
: tuple with quantiles (68%, 95%, 99%)right now we can start implementing this first case with simple modification of what is in branch issue/404/galcat_pdf_improvement
FWIW, this is really pretty heavily overlapping with functionality in qp and in particular the construction of the qp.Ensemble, which represents a set of pdfs. I.e.,
qp.Ensemble(qp.interp, data=dict(xvals=yvals, yvals=yvals)) # where xvals is an array of nPoints, yVals is an array of (nPoint, nPdfs)
qp.Ensemble(qp.quant, data=dict(quants=quants, locs=locs)) # where quants are the quants (i.e., 0.68, 0.95, 0.98) and locs are the corresponding locations
It would be good to chat a bit about this.
This is linked to issue #401 (interfacing 'clmm
and qp
)
The current implementation with
photoz_sigma_unscaled
generates a pdf that is sampled differently for every galaxy. From_compute_photoz_pdfs
:so the result is a column like:
This is both very inconvenient, because the user then has to interpolate if they want to combine the pdfs, and a big waste of space.
I propose to change this for a single binning scheme that depends on
(zsrc_min,zsrc_max)
and a new optional argumentzsrc_bin_width
or similar, so that the binning can be stored as a separate attribute ofGalaxyCluster
, e.g.,zsrc_pdf_centers
or justzsrc_bins
. See related suggestion in #401.