Add p(z) to GalaxyCluster object

aimalz commented 3 years ago

This issue is for integrating qp with CLMM, which would enable #126 and #396. The most basic version of this would involve a new GalaxyCluster.pz attribute that is a qp.Ensemble object (with one element/row/galaxy, in this case) with its parameters stored in GalaxyCluster.galcat. Though qp already has its own data format, it's essentially an Astropy table, just like the clmm.GCData object, so it should be possible to do this efficiently, particularly with help from @eacharles.

cristobal-sifon commented 3 years ago

I can have a go at this, but I wonder why would we make it a new attribute of GalaxyCluster, when no other individual columns of galcat are their own GalaxyCluster attribute? I'll follow along if that is indeed the preferred implementation, just want to make sure

combet commented 3 years ago

I'm not familiar with qp so I might be missing something, but like @cristobal-sifon I also wonder why we'd need a new attribute. When we generate mock data with photoz error, the redshift pdf (bins and values) is stored in the galcat table and I had assumed we would still keep it in the galcat table, even with a qp implementation.

m-aguena commented 3 years ago

I agree with @cristobal-sifon and @combet that we should keep it as simple as possible. I believe we could have some wrapper functions that just pass the necessary info from GalaxyCluster.galcat to qp objects and use qp functions. Unless creating the qp.Ensamble object is a heavy/slow process (and it does not seem like it is as @aimalz pointed that they are Astropy table like objects), I would just do it inside these wrapper functions.

johannct commented 3 years ago

I'd be happy to assign this to myself, as an attempt to finally do something in CLMM land..... If we want to keep qp philosophy, it would seem to me that its representation of the pdf should be a separate table, either keeping the same indexing order, or adding an id field for joins. I will try to look at the CLMM code to understand better what is at stake. But I have a conceptual difficulty with adding to a galcat table a qp singleton for each entry..... (I cannot assign to myself, probably because I am not in the CLMM repo team)

combet commented 3 years ago

Thanks @johannct! IMO, the galcat table could/should remain the entry point to the data for the user. Very naively and maybe too inefficient, could we not copy the pdz field of the galcat table into a qp table, do what we need to do with it, and then store the results back in the galcat table. So that it remains transparent to the user?

johannct commented 3 years ago

I'll have to start looking at the code to judge, but we do not want to turn the code into something hard to understand and maintain.

cristobal-sifon commented 3 years ago

Following #404, I would like to suggest here that, while the p(z) should be just another column in galcat (as I already suggested), there could be a separate GalaxyCluster attribute zsrc_bin_width or similar, instead of a pzbins column (as it is currently called when adding z errors in mock_data.generate_galaxy_catalog). I don't think this is too inconvenient a requirement for the user (i.e., that there be a single binning scheme for all source galaxy redshift pdfs), and I bet everyone has z pdfs this way anyway.

johannct commented 3 years ago

There is something that puzzles me : I see indeed that mock_data.py mocks some pdfs at some point. But I do not see anywhere that GalaxyCluster.py and gcdata have an API entry for redshift pdfs. Shouldn't they?

m-aguena commented 3 years ago

The current state of CLMM (the version prepared for the paper) does not support pdfs. Yes, the plan is to have API entry for pdfs in GalaxyCluster.py and gcdata but, as you can see, we are still figuring out the best way to interact with the pdfs. The reason mock_data.py does have some pdfs is it was inittially just a support piece of code for the notebooks that ended up being added to the main code.

That being said, we are still flexible on the development to store pdfs in the gcdata object and can use this opportunity do define how we will add it. @johannct how is the photo-z info of a galaxy sample is passed to qp? Is one unified array with the redshift bins and a table with the pdfs?

johannct commented 3 years ago

So either the pdfs are already saved in a format that qp can resolve, or qp need a helper function to know how to interpret what it is provided (for instance samples drawn from the pdf, x-y pairs of evaluations of the pdf, quantiles, sparse indices, etc...). In either case, one ends up with a qp ensemble object. One can convert this back into a representation suited to an application, but the longer the pdfs stay as qp ensemble, the better it would likely be. Again, I would like to see how the core CLMM modules would interact with the PDF, before trying to bring qp in. We do not have to get the API right at the first shot, it is ok to try and scrap some initial attempts in order to learn the best way to proceed. So I have some questions :

GalaxyCluster has a "z" for the distance to the cluster. I presume that this is a side product of a cluster finding algorithm, and that the various pdfs of the galaxy members are used somehow to end up with a pdf for the cluster redshift. Is that correct? Is the intent here to attempt at using a pdf here? If yes, how?
gcdata have a z attribute, but it seems that, as the getitem function is already overridden, it would be straightforward to access the pdf from a qp ensemble, rather than a z point estimate already set in the astropy table. Again, the main question first is : how does CLMM builds its gcdata and GalaxyCluster object in memory? It has to read the info from a catalog, likely persisted into a file.... Exercising the code by mocking a catalog directly in memory without any read access defined to a storage is deceiving in that respect.

I can propose a thought exercise. No guarantee at all that this is what will materialize officially, I am just inventing it : the cluster finding algorithm run by the DESC TBD group in charge of producing added value catalogs for the collaboration will persist its output as a fits file with at least 3 extensions : 1 with all the clusters found, including perhaps a z point estimate as a quick helper, a second one with information about all the galaxies members of a cluster present in the 1st extension, and a 3rd extension with the pdfs of all the clusters present in extension 1 (I presume having the pdfs of all the galaxies in the extension 2 will then require a join with the object catalog). If this is so, CLMM will be very directed into how to read this file and build its two main class instances. A lot of people here and elsewhere in DESC know much better than me what has been done in other surveys, like DES. How is a cluster catalog persisted and read back? Is CLMM able, provided a simple reader interface dedicated to it is written, to read DES cluster catalog for instance? IMHO all this is preliminary to looking for the API to qp itself. The reason in my mind being that qp is not only a representation interface, but also a storage library for 1-dimensional pdfs (and all these again are my thoughts only, not an official statement from PZWG :) )

m-aguena commented 3 years ago

Current approaches of cluster use point source estimate for cluster redshifts as its uncertainties as much smaller than galaxy photo-zs. This is being a result of "combining" multiple galaxy redshifts to compose the cluster redshift and/or using the red-sequence to estimate the redshift and/or using only the brightest galaxies with better photo-zs.

To answer your questions:

Right now we are considering a point estimate for the redshift of the GalaxyCluster, that is what is usually provided by cluster finders. It would be interesting to think about having a cluster pdf, but that is not in the initial scope of CLMM.
We have not implemented a specific way for CLMM to read the data. In its current form, you read it in whichever form you prefer and pass it to the GalaxyCluster object as aguments. So the burden of organizing the data relies on the user.

For DES, the optical cluster finders provide separate fits files for clusters, members and maps. No cluster pdfs are provided, but it could be constructed with the member catalogs. I understand your proposed execise, but I think a discussion about using cluster pdf should be done before going though all the work of implementing it.

aimalz commented 3 years ago

(For completeness, I'm looping in @sschmidt23 as well as @eacharles.)

johannct commented 3 years ago

@m-aguena oh sure, deciding on using pdf at all should come first! :) and it is largely independent of qp

combet commented 3 years ago

Hi, I'm catching up on the discussion. Very concretely, the first place we need to deal with the background galaxy photoz pdf is to efficiently compute the effective inverse critical density for each galaxy as given in #396. Naively looping over all the background galaxies to compute the integral for each one would take too long, especially when considering a stacked analysis of a few thousands clusters. We need a way to speed up/vectorize that calculation. So what's not clear to me is whether qp provides such functionality? Are there other options to do so?

Now, regarding the pdf of the cluster redshift, I don't have a use case example where it would be necessary to have this implemented at this stage, but maybe others do?

cristobal-sifon commented 3 years ago

@combet scipy.integrate.trapz should do the trick right? or am I missing something?

combet commented 3 years ago

Oh yes, thanks for pointing this out @cristobal-sifon. If Nbins is the same for each galaxy, I understand the x and y arguments of scipy.integrate.trapz can be 2d arrays of size Nbins x Ngals so that the integration can be automatically performed for each galaxy. But I'm also wondering whether the number of bins over which the photoz pdf is sampled may differ from one galaxy to the next and/or whether qp could provide more flexibility and/or an even faster option.

Maybe a good first step would be to test the scipy.integrate functions (trapz or simps) on mock data and do some timing/precision evaluation?

cristobal-sifon commented 3 years ago

As I pointed out in #404, I don't see any justification for having different z bins in what is any way a set of poorly sampled probability distributions.

trapz can take numpy arrays of any shape and it is pretty precise so long as the integral doesn't diverge. The integration variable can be 1d or nd so long as it is consistent with the integrand. simps works the same, it's slightly more precise but somewhat slower. But I guess it won't hurt to test explicitly.

cristobal-sifon commented 3 years ago

Now, regarding the pdf of the cluster redshift, I don't have a use case example where it would be necessary to have this implemented at this stage, but maybe others do?

I agree, this is usually known with good enough precision, and the uncertainty on the source redshift distribution is by and large the dominant uncertainty. So I also suggest ignoring it for now at least.

m-aguena commented 2 years ago

I created a branch related to this issue (issue/401/use_qp) to start integrating qp with CLMM. The notebook test_pz_integral.ipynb shows the comparison of the PDZ integral between CLMM internal method and qp. It produced this plot:

m-aguena commented 2 years ago

Reminder:

[ ] also use qp in mock data generation

m-aguena commented 2 years ago

Actually, it seems there was a problem with equilizing the bins in CLMM. If pdfs with the same bins are provided, the comparison with qp is much closer:

Note: the difference comes from truncated non-normalized pdf (qp is probably in the right here):

LSSTDESC / CLMM

Add p(z) to GalaxyCluster object #401