Open shoepfl opened 1 year ago
Hey Sebastian,
yes, using the HDI computed from the full posterior would definitely be preferable. We can add this to the todo list.
If you happen to have a good code lying around, we would be happy if you could make a suggestion.
Best, Jan
Hi Jan,
my idea was to take the Highest Posterior Density (HPD), which could be calculated by sorting the posterior values and throwing away the lowest x-percent of these values.
A short example would be:
import h5py
import numpy as np
file = h5py.File(example_pypesto_mcmc_chain.hdf5'), 'r')
neglogpost = np.sort(self.file['sampling']['results']['trace_neglogpost'][:][0])
n_samples = self.file['sampling']['results']['trace_neglogpost'].shape[1] # total number of posterior samples
# lower 1-ci percentage samples of the posterior are removed to get the according HPD
self.lower_neglogpost = self.neglogpost[int(n_samples*(1-ci)):]
of course, this comes with the effect that the interval might be split in multimodal cases.
Adding to this, here would be a function that calculates HPD parameters based on the Pypesto sampling result:
def calculate_hpd(pypesto_result, ci_level: float, stepsize: int = 1) -> pd.DataFrame:
"""Calculate Highest Posterior Density (HPD) samples of pypesto sampling result.
Input:
pypesto_result: The pyPESTO result object with filled sample result.
ci_level: credibility level of HPD
stepsize: Only one in
stepsizevalues is plotted.
"""
# get names of chain parameters
param_names = pypesto_result.problem.get_reduced_vector(pypesto_result.problem.x_names)
# first calculate burn in
burn_in = geweke_test(pypesto_result)
# Get converged parameter samples as numpy arrays
chain = np.asarray(pypesto_result.sample_result.trace_x[0, burn_in:, :])
neglogpost = pypesto_result.sample_result.trace_neglogpost[0, burn_in:]
indices = np.arange(burn_in, len(pypesto_result.sample_result.trace_neglogpost[0, :]))
# create df first, as we need to match neglogpost to the according parameter values
pd_params = pd.DataFrame(chain, columns=param_names)
pd_fval = pd.DataFrame(neglogpost, columns=['neglogPosterior'])
pd_iter = pd.DataFrame(indices, columns=['iteration'])
params_df = pd.concat(
[pd_params, pd_fval, pd_iter], axis=1, ignore_index=False
)
# get lower neglogpost bound for HPD
# sort neglogpost values of MCMC chain without burn in
neglogpost_sort = np.sort(neglogpost)
# Get converged chain length
chain_length = len(neglogpost)
# most negative ci percentage samples of the posterior are kept to get the according HPD
neglogpost_lower_bound = neglogpost_sort[int(chain_length*(ci_level))]
# cut posterior to hpd
hpd_params_df = params_df[params_df['neglogPosterior'] <= neglogpost_lower_bound]
# thin out by stepsize, from the index burn_in until end of vector
hpd_params_thinned = hpd_params_df.iloc[::stepsize, :]
return hpd_params_thinned
This way, we get a Pandas dataframe with the HPD samples to a certain credibility level. It can be thinned by the "stepsize" parameter. For the ensemble prediction, all parameters or a subset of the HPD parameters can be simulated, and no marginal correction has to be made at the simulation level.
There is also a version where I cut the posterior to the HPD using numpy instead of pandas, which might be a speedup. However, the pandas version is quite handy as it has the parameter names connected to the parameter values.
Let me know what you think about it :)
Hi there,
i recently figured out that the credibility intervals in pyPESTO are not calculated based on the x-percent HDI of the joint posterior but via percentiles for each parameter separately. This is for example in the function
calculate_ci_mcmc_sample
the case.Wouldn´t it be better to calculate a posterior threshold based on an x-percent HDI on the joint posterior and afterward restrict the individual parameter CIs to the overall posterior threshold value? I think the other method could be problematic as individual calculation on each parameter does not necessarly have to match with the calculation on the joint posterior.
Best,
Sebastian