New features for estimating observed and predicted distributions (other than KDE)

yentyu commented 3 years ago

@eecsu @smattis We would like to add different options for how the observed and predicted distributions are estimated (i.e., with Bayesian GMM) when calculating R in the data-consistent framework. In order to accomplish this in a fashion which allows for future development, I've come up with an outline of how these features can be added to the code in a smooth and consistent way. @eecsu and I just wanted to start a discussion to make sure the general approach seems reasonable and we aren't overlooking anything that might break things or cause big problems with the package.

Here's the general idea. Currently, the discretization object points to 3 different sample_set_base objects: an input (or initial), a output (or predicted), and an observed. We want to use these objects to save the estimated pdfs for observed and predicted to their respective sample_set_base object using the _prob_type and _prob_parameters and then call the base object's evaluate_pdf method when computing R. This has some nice benefits:

It makes it easier to expand the different options for estimating pdfs by adding new _prob_type to the base object
It cuts down on repeated computations for situations when you want to do a batch of inversions with the same predicted but different observed distributions (because the predicted distribution is only computed once)

Here is the specific outline of the changes this would involve:

Step 1: Update sample_set_base object

change Need to fix the evaluate_pdf function for KDEs so that it can evaluate multidimensional KDE pdfs (instead of relying on the evaluation of a set of marignals)
reason The scipy.stats.gaussian_kde objects will need to be saved as the _prob_parameters of the base object and the _prob_type as "kde". For the mixture modeling (Bayesian or otherwise), we just need to save the output parameters from the sklearn.mixture, i.e. the means, covariances, weights (in this order) as the _prob_parameters with _prob_type as "gmm".

Step 2: Create generate_densities function

new Create a function with discretization object as input and options for computing either the observed or predicted distribution (or both) as well as the TYPE of pdf. Options will include:
- "kde" - scipy stats gaussian kernel density estimate
- "gmm" - sklearn gaussian mixture model
- "bgm" - sklearn bayesian gaussian mixture model (including Dirichlet process)
Options should include the passing of arguments related to utilizing these different distribution estimates
Output of the function should return the requested kdes and clusters (see current implementation of generate_output_kdes
Function should also SAVE the parameters to the appropriate sample_set_base objects associated with the discretization object. For instance, if generating the predicted distribution, the function should use the following to save the pdf:
```
discretization.get_output_sample_set().set_prob_type(this_arg)
discretization.get_output_sample_set().set_prob_paramaters(these_params)
```

Step 3: Change invert_... Methods

change Most current inversion methods call generate_output_kdes and then call invert. But invert ALSO calls generate_output_kdes. This seems like unnecessary computation. We should remove the call to generate_output_kdes except in the invert function.
new Change the invert method to call the new generate_densities function defined above to generate the output distributions.
- There should be a check to determine whether to do this step or not. If we are running a batch of inversions using the same observed or predicted, we should not try to regenerate both of these distributions each time. Instead, we should use the pdf that was saved to the sample_set_base object of the discretization.
The invert method should use the saved sample_set_base observed and predicted pdfs associated with the discretization to compute the r-values.

smattis commented 3 years ago

Hey @yentyu. Thanks for proposing this. I totally agree that this change that you are proposing is the best way to do things. @eecsu and I had discussions about this back in the summer. I hardcoded the method of using KDEs on the full space to calculate the ratios, because that was the main way that we were doing it in our work. I agree that it should instead be done using evaluate_pdf and should let you use any type of probability format (KDE, scipy.stats.random_variable, GMM, etc). This is something that I have been wanting to do, but have not had the time due to my job and baby, so I am very glad that it is something that you want to take on. I will try to help as best I can. I will give some thoughts to your proposed steps below.

smattis commented 3 years ago

Step 1: I don't think we want to fully remove the ability to have the probability described as a product of marginals, where each marginal is a kde. This especially may be useful in problems with high-dimensional parameter spaces. Perhaps that this could be renamed kde_marginals.

I totally agree that we should have a proper kde representation to go along with that.

smattis commented 3 years ago

Step 2: Yes, this should be done. I think I am mostly in agreement with the way you want to implement this. We also want to keep the ability to do a GMM, where the mixture is based on Gaussians coming from the different clusters (which is how gmm is being used now.)

smattis commented 3 years ago

Step 3:

Most current inversion methods call generate_output_kdes and then call invert. But invert ALSO calls generate_output_kdes. >This seems like unnecessary computation. We should remove the call to generate_output_kdes except in the invert function.

Seeing your comment, this is actually a bug that I did not catch as I was modifying the way the invert_* methods worked. You could write a PR to fix that now. I agree that once generate_densities is created, that all of the invert methods should be changed as you say.

UT-CHG / BET

New features for estimating observed and predicted distributions (other than KDE) #400