Open jysullivan opened 5 years ago
Sample code to implement posterior predictive intervals for egg deposition (not putting this in the code until it's been reviewed):
df <- as.data.frame(D[["data_egg_dep"]])
colnames(df) <- c("Year", "obs_egg", "log_se")
df %>%
filter(Year >= D[["mod_syr"]]) %>%
mutate(Year = factor(Year)) %>%
select(Year, log_se) %>%
left_join(egg_sum) -> egg_sum
egg_sum %>%
mutate(log_value = log(value),
# Posterior predictive value
pp_value = exp(rnorm(n(), mean = log_value, sd = log_se))) -> egg_sum
egg_sum %>%
group_by(Year) %>%
mutate(pp_mean = mean(pp_value),
p025 = quantile(pp_value, 0.025),
p975 = quantile(pp_value, 0.975)) -> egg_sum
df <- as.data.frame(D[["data_egg_dep"]][ , 1:2])
colnames(df) <- c("Year", "obs")
egg_sum %>% distinct(Year, mean, q025, q975, p025, p975) %>%
ungroup() %>%
mutate(Year = as.numeric(as.character(Year))) -> egg_sum2
df %>%
filter(Year >= D[["mod_syr"]]) -> df
axisx <- tickr(df, Year, 5)
df %>%
# 2008 survey egg estimates were extreme high and variable, pmin() helps
# reduce the scale of the upper confidence interval for better visualization:
mutate(egg_upper = pmin(obs + LS_byyear$egg_upper, max(obs) * 1.2),
egg_lower = obs - LS_byyear$egg_lower,
pred = D[["pred_egg_dep"]]) %>%
ggplot(aes(x = Year)) +
# Credibility intervals.
geom_point(aes(y = obs, shape = "Historical estimates from survey")) +
geom_ribbon(data = egg_sum2, aes(x = Year, ymin = p025, ymax = p975),
alpha = 0.6, fill = "grey70") +
geom_ribbon(data = egg_sum2, aes(x = Year, ymin = q025, ymax = q975),
alpha = 0.6, fill = "grey40") +
geom_line(data = egg_sum2, aes(x = Year, y = mean)) +
# Egg dep confidence intervals (To add whiskers, remove width=0)
geom_errorbar(aes(ymin = egg_lower, ymax = egg_upper), colour = "black", width = 0, size = 0.001) + scale_colour_manual(values = "grey") +
theme(legend.position = c(0.25, 0.8),
legend.spacing.y = unit(0, "cm")) +
scale_x_continuous(breaks = axisx$breaks, labels = axisx$labels) +
scale_y_continuous(limits = c(0, max(df$obs) * 1.2)) +
labs(x = NULL, y = "Eggs spawned (trillions)\n", shape = NULL, linetype = NULL)
Caption: Light grey is 95% posterior predictive interval, dark grey is 95% credible interval.
The previous commit was producing sensible posterior predictive intervals for the egg deposition or the compositions:
// Estimate posterior predictive intervals for abundance indices and composition data
int pp_seed = 123;
random_number_generator rng(pp_seed);
// Posterior predictive interval for predicted catch
dvector pp_catch(mod_syr,mod_nyr); // empty vector to hold prediction from each posterior sample
pp_catch.initialize();
double mu_catch;
double sigma_catch;
// Sample from lognormal distribution
for(int i = mod_syr; i <= mod_nyr; i++) {
mu_catch = value(log(pred_catch(i))); // assume prediction = mean, assume lognormal distribution
sigma_catch = TINY + column(data_catch,3)(i); // log_se
pp_catch(i) = mfexp(rnorm(mu_catch,sigma_catch,rng)); // put prediction back on natural scale
}
// Posterior predictive intervals for egg deposition with lognormal distribution:
dvector pp_egg_dep(mod_syr,mod_nyr); // empty vector to hold prediction from each posterior sample
pp_egg_dep.initialize();
double mu_egg_dep;
double sigma_egg_dep;
// Sample from lognormal distribution
for(int i = mod_syr; i <= mod_nyr; i++) {
mu_egg_dep = value(log(pred_egg_dep(i))); // assume prediction = mean, assume lognormal distribution
sigma_egg_dep = TINY + column(data_egg_dep,3)(i); // log_se
pp_egg_dep(i) = mfexp(rnorm(mu_egg_dep,sigma_egg_dep,rng)); // put prediction back on natural scale
}
// Posterior predictive intervals for age comps with multivariate logistic distribution:
dmatrix pp_sp_comp(mod_syr,mod_nyr,sage,nage); // empty matrix to hold prediction from each posterior sample
dmatrix pp_cm_comp(mod_syr,mod_nyr,sage,nage);
pp_sp_comp.initialize();
pp_cm_comp.initialize();
double pp_sp_tau2; // variance estimates for each posterior sample (single variance across all years)
double pp_cm_tau2;
double minp = 0.00; // threshold proportion <= grouped with next bin
// spawning survey age composition
dmatrix d_sp_comp = trans(trans(data_sp_comp).sub(sage,nage)).sub(mod_syr,mod_nyr);
pp_sp_tau2 = value(dmvlogistic(d_sp_comp,pred_sp_comp,resd_sp_comp,pp_sp_tau2,minp,false)); // value() returns double
// commercial fishery age composition
dmatrix d_cm_comp = trans(trans(data_cm_comp).sub(sage,nage)).sub(mod_syr,mod_nyr);
pp_cm_tau2 = value(dmvlogistic(d_cm_comp,pred_cm_comp,resd_cm_comp,pp_cm_tau2,minp,false)); // false = extract variance instead of nll
// sample from multivariate logistic distribution
for(int i = mod_syr; i <= mod_nyr; i++) {
dvector pp_sp = value(pred_sp_comp(i));
dvector pp_cm = value(pred_cm_comp(i));
pp_sp_comp(i)(sage,nage) = rmvlogistic(pp_sp,pp_sp_tau2,pp_seed-i);
pp_cm_comp(i)(sage,nage) = rmvlogistic(pp_cm,pp_cm_tau2,pp_seed-i);
}
It's easy enough to produce the intervals in R for the egg deposition using rnorm()
, so I'm going to try writing the rmvlogistic()
in R instead.
Definitions from Wikipedia and Cross Validated A credible interval [a,b] is a subset of the parameter space such that P(a≤Θ≤b∣X1=x1,…,Xn=xn)=α, and it means that, after seeing the data, you believe that with probability α the parameter value is inside this interval.
A posterior predictive interval [u,v] is a subset of the sampling space such that P(u≤Xn+1≤v∣X1=x1,…,Xn=xn)=γ, and it means that, after seeing the data, you believe that with probability γ the value of a future observation Xn+1 will be inside this interval. In other words, a posterior predictive interval is the distribution of possible unobserved values conditional on the observed values.
An alternative explanation from Trevor Branch 2019-06-11:
Posterior interval = credible interval = the interval around estimates of model parameters (the more data, the narrower the intervals, a bit like a standard error). Posterior predictive intervals give an estimate of the real variability, taking into account the estimate of variance (a bit like standard deviation, which describes population variability).
For example, imagine you are trying to estimate the height of people in Seattle, thus there are two parameters: mean height, and SD of height. We would get posterior intervals (or credible intervals) for mean height, and the more people we measured, the narrower the credible interval would be on the mean height parameter. But if someone asked what the distribution of heights of people in Seattle was, that would be the posterior predictive interval, which takes into account both estimates of mean height (X) and estimates of SD of height (Y). It could be obtained by taking the posterior samples, and for each pair of Xi, Yi, drawing a sample Li from a normal with mean Xi and SD = Yi. The distribution of Li values would be the posterior predictive and would be a realistic representation of what the distribution of heights in Seattle looks like.