LSSTDESC / firecrown

DESC Cosmology Likelihood Framework
BSD 3-Clause "New" or "Revised" License
29 stars 7 forks source link

Add support for posterior predictive distributions #323

Open tilmantroester opened 9 months ago

tilmantroester commented 9 months ago

This requires functionality to draw samples of data vectors from the likelihood and passing them back to the sampling framework.

The ability to draw samples from the likelihood is useful in other contexts as well, such as generating mock data vectors.

vitenti commented 9 months ago

This is already supported by NumCosmo's connector. Moreover, in Augur you can find code to do that, see for example srd_y1_3x2_like.py, where they generate a data vector from a theory vector and build a likelihood to be used by any framework.

marcpaterno commented 9 months ago

@tilmantroester is there something that Augur does, or something part of what it does, that you thing should be moved from Augur to Firecrown?

tilmantroester commented 9 months ago

There are two reasons why I think this should be in firecrown: One reason is that it's easiest to create the PPD draws while sampling instead of trying to create them after the fact. For a Gaussian likelihood with fixed covariance doing it in a post processing step is relatively straightforward if the model predictions get saved during sampling but for other likelihoods this might require re-evaluating the likelihood at a large number of points, which we want to avoid. Drawing posterior predictive samples conditioned on parts of the data vector is probably easier to do in firecrown as well, since the description on how the data vector is structured is readily available there. The other reason is that I might want to be able to use firecrown and generate mock data without the augur dependency, especially when building experimental pipelines.

joezuntz commented 9 months ago

The ability to return data vectors is also useful for general debugging, and I'd recommend saving the information to do this.

However, in cosmosis I did find that the one case where this was slow compared to likelihood evaluations was Supernovae, so perhaps make it optional?

vitenti commented 9 months ago

The CosmoSIS connector presently includes a section in the DataBlock labeled data_vector. This section contains three elements: firecrown_theory (the theory vector), firecrown_data (the data vector), and firecrown_inverse_covariance (the inverse covariance). To have these components written in the output chains, you can add them to the CosmoSIS .ini file under the extra_output section.

This behavior is automatically enabled for GaussFamily likelihoods, but the current implementation is not considered ideal. We are working to refine this process, with the goal of achieving the same outcome using DerivedParameter. The reason for the delay in implementing this change is the inherent difficulty of handling vector-derived parameters without resorting to the solution of appending _n to the derived parameter name to match the vector index.

Furthermore, as pointed out by @joezuntz, including a lengthy theory vector in the output chains can have a detrimental impact on processing speed. In NumCosmo, any data added to the output chains undergoes further processing, which includes computing statistics such as mean, variance, autocorrelation, and more. Additionally, including an extensive theory vector in the output would not only significantly slow down these processing tasks but also result in exceptionally large output files.

Thus, I think we should make this behavior optional and eventually move to a more general solutions using DerivedParameter so all frameworks can use it equally. @tilmantroester, would you prefer a more complete solution where random draws are also performed from each theory + covariance?

tilmantroester commented 9 months ago

At this point I'm not too concerned about how this gets piped back to the sampling frameworks. For now I imagine just implementing a method sample in the likelihood class. This could then be optionally be put into some data block of the sampling framework.

As you said, treating theory or mock data vectors as derived parameters and dumping them into the chain output is at best cumbersome and at worst breaks the IO. Dealing with derived data that isn't just a parameter is something that the sampling frameworks would have to implement I think. I don't know if there is such a functionality in cosmosis yet @joezuntz

joezuntz commented 9 months ago

Samplers in CosmoSIS (or scripts using it interactively) can fully access the data block containing all the products of a pipeline, including the data vectors, so yes, this is already there. It's used in the Fisher sampler, for example.

tilmantroester commented 9 months ago

Sorry, what I meant was, is there a way to efficiently save parts of the data block while sampling, independently of the default chain output. For example, saving the theory vector at each chain sample to a file, taking care of the usual IO pitfalls like multiple MPI processes, and without having an unwieldy extra_output option with an entry for each data point.

joezuntz commented 9 months ago

Oh, I see. You can specify a vector output for extra_output, if you know the length in advance, by doing, e.g. for a length 222 data vector, extra_output = data_vector/2pt_theory#222. I know that's a bit annoying. I don't have another approach built-in.