ICB-DCM / pyPESTO

python Parameter EStimation TOolbox
https://pypesto.readthedocs.io
BSD 3-Clause "New" or "Revised" License
216 stars 47 forks source link

Calculation of credibility intervalls currently not via joint posterior #1117

Open shoepfl opened 1 year ago

shoepfl commented 1 year ago

Hi there,

i recently figured out that the credibility intervals in pyPESTO are not calculated based on the x-percent HDI of the joint posterior but via percentiles for each parameter separately. This is for example in the function calculate_ci_mcmc_sample the case.

Wouldn´t it be better to calculate a posterior threshold based on an x-percent HDI on the joint posterior and afterward restrict the individual parameter CIs to the overall posterior threshold value? I think the other method could be problematic as individual calculation on each parameter does not necessarly have to match with the calculation on the joint posterior.

Best,

Sebastian

JanHasenauer commented 1 year ago

Hey Sebastian,

yes, using the HDI computed from the full posterior would definitely be preferable. We can add this to the todo list.

If you happen to have a good code lying around, we would be happy if you could make a suggestion.

Best, Jan

shoepfl commented 1 year ago

Hi Jan,

my idea was to take the Highest Posterior Density (HPD), which could be calculated by sorting the posterior values and throwing away the lowest x-percent of these values.

A short example would be: import h5py import numpy as np

file = h5py.File(example_pypesto_mcmc_chain.hdf5'), 'r') neglogpost = np.sort(self.file['sampling']['results']['trace_neglogpost'][:][0]) n_samples = self.file['sampling']['results']['trace_neglogpost'].shape[1] # total number of posterior samples # lower 1-ci percentage samples of the posterior are removed to get the according HPD self.lower_neglogpost = self.neglogpost[int(n_samples*(1-ci)):]

of course, this comes with the effect that the interval might be split in multimodal cases.

shoepfl commented 2 months ago

Adding to this, here would be a function that calculates HPD parameters based on the Pypesto sampling result:

def calculate_hpd(pypesto_result, ci_level: float, stepsize: int = 1) -> pd.DataFrame: """Calculate Highest Posterior Density (HPD) samples of pypesto sampling result.

Input: pypesto_result: The pyPESTO result object with filled sample result. ci_level: credibility level of HPD stepsize: Only one instepsizevalues is plotted. """ # get names of chain parameters param_names = pypesto_result.problem.get_reduced_vector(pypesto_result.problem.x_names)

# first calculate burn in burn_in = geweke_test(pypesto_result)

# Get converged parameter samples as numpy arrays chain = np.asarray(pypesto_result.sample_result.trace_x[0, burn_in:, :]) neglogpost = pypesto_result.sample_result.trace_neglogpost[0, burn_in:] indices = np.arange(burn_in, len(pypesto_result.sample_result.trace_neglogpost[0, :]))

# create df first, as we need to match neglogpost to the according parameter values pd_params = pd.DataFrame(chain, columns=param_names) pd_fval = pd.DataFrame(neglogpost, columns=['neglogPosterior']) pd_iter = pd.DataFrame(indices, columns=['iteration'])

params_df = pd.concat( [pd_params, pd_fval, pd_iter], axis=1, ignore_index=False )

# get lower neglogpost bound for HPD # sort neglogpost values of MCMC chain without burn in neglogpost_sort = np.sort(neglogpost)

# Get converged chain length chain_length = len(neglogpost)

# most negative ci percentage samples of the posterior are kept to get the according HPD neglogpost_lower_bound = neglogpost_sort[int(chain_length*(ci_level))]

# cut posterior to hpd hpd_params_df = params_df[params_df['neglogPosterior'] <= neglogpost_lower_bound]

# thin out by stepsize, from the index burn_in until end of vector hpd_params_thinned = hpd_params_df.iloc[::stepsize, :]

return hpd_params_thinned

This way, we get a Pandas dataframe with the HPD samples to a certain credibility level. It can be thinned by the "stepsize" parameter. For the ensemble prediction, all parameters or a subset of the HPD parameters can be simulated, and no marginal correction has to be made at the simulation level.

There is also a version where I cut the posterior to the HPD using numpy instead of pandas, which might be a speedup. However, the pandas version is quite handy as it has the parameter names connected to the parameter values.

Let me know what you think about it :)