LSSTDESC / DC2-production

Configuration, production, validation specifications and tools for the DC2 Data Set.
BSD 3-Clause "New" or "Revised" License
11 stars 7 forks source link

True photo-z PDFs #18

Closed aimalz closed 4 years ago

aimalz commented 6 years ago

[Sorry if this isn't the right place or time for this issue.]

In preparing the PZ DC1 paper, we identified a need for "true" redshift PDFs, which must correspond to a forward model for galaxy photometry. tl;dr Can the galaxy photometry of the DC2 catalog be determined such that the PZ WG has access to true redshift PDFs?

To produce true redshift PDFs, there would have to be an explicitly defined probability distribution in the space of redshift and "observed" photometric magnitudes/colors and their errors. This could be the result of a marginalization over other galaxy parameters like galaxy type (be it SED type or central vs. satellite, etc.) that have different distributions in the space of magnitudes/colors, photometric errors, and redshift, so long as the distributions for each type are weighted in accordance with their actual redshift-dependent proportions in the dataset. Each galaxy would have to have final catalog photometry drawn from the probability distribution of magnitudes/colors evaluated at its true redshift (i.e. the likelihood), and the redshift PDF would be the probability distribution of the redshift evaluated at its assigned magnitudes/colors (i.e. the posterior). If there were enough galaxies in the catalog to be dense in the space of redshift and magnitudes/colors, we could construct this after the DC2 catalog is produced, but we do not expect this to be the case based on the DC1 catalog.

Is this scheme compatible with the way photometry will be assigned to galaxies in DC2? If not, would it be possible to adjust the procedure to permit true redshift PDFs to be created?

My understanding based on the document is that this could probably be done in the space of "true" photometry but perhaps not in the space of "observed" photometry, as the latter is generated after the former has been set. (Also, Fig. 1 indicates PZ gets a catalog untouched by the SSim WG -- is this intentional?) Another way to make the redshift PDFs could be to artificially overpopulate a photometric catalog with many more galaxies than will actually be included, so the space of true redshifts and observed magnitudes/colors and their errors can be densely filled with representative samples and reasonably interpolated. Given the high dimensionality of the probability space, however, that may be computationally prohibitive.

aimalz commented 6 years ago

I just saw an email (about the 16 November SSim telecon) that hinted at a generative model for galaxy morphology, although it wasn't totally clear if it would be part of DC2. Since the topic is on people's minds, I just wanted to bump up this issue on the forward modeling radar, too.

cwwalter commented 6 years ago

Hi Alex, GANs won't be ready for DC2 but we will probably apply knots following a random walk in the imSim simulation with parameters being fit to the COSMOS sample. Look at Francois' presentation at the SSim meeting two weeks ago for details.

We envision this being the 1st of several morphology choices for the future. Does that help?

salmanhabib commented 6 years ago

To me GANs constitute a separate R&D project. After appropriate V&V, we should certainly consider these techniques (among others, BTW), but a certain science threshold has to be passed before we consider them to be production.

aimalz commented 6 years ago

@cwwalter Thanks, I think I now see why it would be quite challenging to populate the space of redshifts and observed fluxes/magnitudes/colors under the post-DC2 GAN plans using a forward model, due to requiring an additional and possibly computationally prohibitive step in the pipeline from true redshifts to observed photometry. In the current (proto-DC2) framework without knots, how hard would it be to simulate observed photometry for enough galaxies to populate that space empirically? (I would expect the necessary number of galaxies to be higher than the number that would otherwise be produced for the DC2 catalog but don't have a concrete number at this time.)

cwwalter commented 6 years ago

@aimalz This is probably because I am not an expert, but I am having a hard time understanding/answering the question.

For the simulation we actually know the true redshift for each galaxy. We even know its Hubble redshift and peculiar motion separately. So, the true PDF of the redshift for an individual galaxy is a delta function. So you don't need to 'produce true redshift PDF", you already know it.

Are you instead asking about making a posterior PDF that you hope correctly encompasses the true PDF?

aimalz commented 6 years ago

@cwwalter Ah, yes, there's a terrible problem with the language of probabilistic photo-zs! I'm talking about posteriors conditioned on observed photometry. The delta function is the probability of the redshift given the true redshift, but we'll need to do tests when the true redshift is not given and the final photometrys is. We could then compare that to the estimates of the posterior of redshift given photometry (plus the assumptions of the estimation method) in the PZ DC2 paper. To have that, however, the redshift and photometry would actually have to have been draws from a joint probability distribution defined over redshift and photometry. The true posterior would be the evaluation of that space at the drawn observed photometry but not the drawn true redshift.

cwwalter commented 6 years ago

The true posterior would be the evaluation of that space at the drawn observed photometry but not the drawn true redshift.

OK, got it thanks!

So,my next basic stumbling block: :)

We don't "assign photometry" in the simulations. We simulate the galaxies with a SED as assigned by the galaxy inpainting. We use that SED to simulate the light arriving through the atmosphere and at the telescope and then we run DM on it. Currently we run a few photometry algorithms including cmodel.

You can take a look here:

https://github.com/LSSTDESC/SSim_DC1/blob/master/Notebooks/Dask/DC1%20Dask%20Access.ipynb

to get an idea of what is available.

The result of those algorithms are our photometry measurements. So, I'm not positive what you are asking. Are you asking if there is a way for us to make sure we sample the true space evenly?

aimalz commented 6 years ago

We don't "assign photometry" in the simulations. . .

This is why I think it could be quite challenging to make the probability space over redshift and photometry. It sounds like you have some probability distribution over redshift and SED that you use for assigning SEDs to galaxies once their redshifts are set. To have a probability distribution over photometry and redshift, there would have to be an explicit way to convert from an SED and redshift to photometry. I could imagine doing this with some kind of empirical diffusion map (take points in the redshift-SED space and do the simulation involving the atmosphere and telescope for those points and see where they end up in redshift-photometry space) if it can't be done analytically/algorithmically.

There's no requirement of uniform sampling in the redshift-SED nor redshift-photometry spaces, but it would be useful to have the space of redshift and photometry defined such that we can reasonably evaluate posteriors in that space. So, there would have to be sufficient support but not necessarily even coverage. In the empirical case, that could mean running the stochastic part of the simulation involving the atmosphere and telescope multiple times for areas of the redshift-SED space that are sparsely sampled by the drawn galaxies so the redshift-photometry space has enough support.

cwwalter commented 6 years ago

This is why I think it could be quite challenging to make the probability space over redshift and photometry. It sounds like you have some probability distribution over redshift and SED that you use for assigning SEDs to galaxies once their redshifts are set. To have a probability distribution over photometry and redshift, there would have to be an explicit way to convert from an SED and redshift to photometry. I could imagine doing this with some kind of empirical diffusion map (take points in the redshift-SED space and do the simulation involving the atmosphere and telescope for those points and see where they end up in redshift-photometry space) if it can't be done analytically/algorithmically.

Right.. I think this is exactly what we are doing. Just imagine you are doing a real experiment with the telescope but you know the truth information for every object you measure. So, we are sampling the sky from the nbody simulation. The dark matter is painted in with galaxies with a distributions and SEDs from a SAM. The light from the galaxies is propagated down to the sensor. Then we run the stack on images the and we get photometry for each galaxy we detect in 6 bands. But we also know the true redshift for that object. We can do catalog matching against our input catalog to make the correspondence.

So now you can populate a distribution in photometery (in as many bands as you want) vs true redshift using each measured galaxy as a point. I think you could use something like a GP to interpolate over that space.

I think everything else has been marginalized over already. If you pick a set of photometries you can read off the redshift PDF.

Is this what you want? Or do you need something more? Or am I missing something basic?

cwwalter commented 6 years ago

Also, alternatively, if you wanted to do this without any errors associated with measurement, the atmosphere etc, you could skip the image simulation and to this at the catalog level using CatSim. For every object you could integrate the SED through the bandpass and calculate a flux in each band. That is basically how we make the input to the image simulations now (there are subtleties as to what is actually passed to each simulation but this can be done as I say).

aimalz commented 6 years ago

So now you can populate a distribution in photometery (in as many bands as you want) vs true redshift using each measured galaxy as a point. I think you could use something like a GP to interpolate over that space. I think everything else has been marginalized over already. If you pick a set of photometries you can read off the redshift PDF. Is this what you want? Or do you need something more? Or am I missing something basic?

That sounds perfect! @salmanhabib just explained the pipeline to me here at Sprint Week, and it sounds very doable. We might try making a toy script for this soon using the pre-lensing, pre-atmosphere, pre-telescope band flux you mentioned as a placeholder until the real thing is ready, but I think it makes sense to wait until there's a more solid plan for PZ DC2 work.

salmanhabib commented 6 years ago

@aimalz @cwwalter I am talking to our stats collaborators about the best way to do this (we have a lot of experience in this space) and will get back when we have a nice solution -- the general problem is of interest for a number of other applications aside from photo-z estimation.

aimalz commented 6 years ago

I've been thinking about what tests we can run to validate the p(z)s obtained from the strategy proposed last week, since they won't be coming from a forward model. I think we'd have to draw redshifts from the PDFs in the catalog (which are p(z | [colors or magnitudes])) and check that the distribution of colors/magnitudes as a function of redshift (which are p([colors or magnitudes] | z)) and statistics thereof are consistent with those derived from the joint distribution of redshift and colors/magnitudes (which is p(z, [colors or magnitudes])) from which the catalog of PDFs were derived. I may or may not be making sense, but I just want to be sure we don't accidentally use circular reasoning that could be used to back out perfect p(z)s from information we won't have with the real LSST data so am at least leaving a note here to think about how to check that the true p(z)s are "right."

katrinheitmann commented 5 years ago

@aimalz @cwwalter @salmanhabib This discussion stopped in December 2017. I am not sure if this was followed up anymore. Should we close this, work on it, move it? Thanks!

aimalz commented 5 years ago

@katrinheitmann Thanks for bringing this up again. I think some folks checked whether DC2's space of redshift and colors/magnitudes is sufficiently dense to interpolate at the Winter 2019 Hack Day. If it is, then for the PZ DC2 experiments, we can ensure that true redshifts and data are drawn from the interpolated space where true PDFs can be evaluated, and then we'll have what we need to do the comparisons. Apologies if I'm misremembering who worked on it, but @jbkalmbach @sschmidt23, if it was either of you, can you confirm if you discovered the data was sufficiently dense for an interpolation, with or without data augmentation via resampling from the observational errors?

jbkalmbach commented 5 years ago

I was working on a Generative Adversarial Network (GAN) to generate new data that approximated the catalog. I got a decent prototype running: image but haven't done anything more at the moment. But if we have some next steps I can use this and help out.

katrinheitmann commented 5 years ago

@aimalz Any comments on @jbkalmbach results? Thanks!

katrinheitmann commented 4 years ago

Another year has passed on this issue. Any updates? Did this move somewhere else? @aimalz Thanks!

aimalz commented 4 years ago

@katrinheitmann Sorry for dropping the ball on this! Active development is happening in the RAIL repository, so I think this is no longer your responsibility.