Posterior predictive intervals

jysullivan commented 5 years ago

Definitions from Wikipedia and Cross Validated A credible interval [a,b] is a subset of the parameter space such that P(a≤Θ≤b∣X1=x1,…,Xn=xn)=α, and it means that, after seeing the data, you believe that with probability α the parameter value is inside this interval.

A posterior predictive interval [u,v] is a subset of the sampling space such that P(u≤Xn+1≤v∣X1=x1,…,Xn=xn)=γ, and it means that, after seeing the data, you believe that with probability γ the value of a future observation Xn+1 will be inside this interval. In other words, a posterior predictive interval is the distribution of possible unobserved values conditional on the observed values.

An alternative explanation from Trevor Branch 2019-06-11:

Posterior interval = credible interval = the interval around estimates of model parameters (the more data, the narrower the intervals, a bit like a standard error). Posterior predictive intervals give an estimate of the real variability, taking into account the estimate of variance (a bit like standard deviation, which describes population variability).

For example, imagine you are trying to estimate the height of people in Seattle, thus there are two parameters: mean height, and SD of height. We would get posterior intervals (or credible intervals) for mean height, and the more people we measured, the narrower the credible interval would be on the mean height parameter. But if someone asked what the distribution of heights of people in Seattle was, that would be the posterior predictive interval, which takes into account both estimates of mean height (X) and estimates of SD of height (Y). It could be obtained by taking the posterior samples, and for each pair of Xi, Yi, drawing a sample Li from a normal with mean Xi and SD = Yi. The distribution of Li values would be the posterior predictive and would be a realistic representation of what the distribution of heights in Seattle looks like.

jysullivan commented 5 years ago

Sample code to implement posterior predictive intervals for egg deposition (not putting this in the code until it's been reviewed):

df <- as.data.frame(D[["data_egg_dep"]])
colnames(df) <- c("Year", "obs_egg", "log_se")
df %>% 
  filter(Year >= D[["mod_syr"]]) %>% 
  mutate(Year = factor(Year)) %>% 
  select(Year, log_se) %>% 
  left_join(egg_sum) -> egg_sum

egg_sum %>%  
  mutate(log_value = log(value),
         # Posterior predictive value
         pp_value = exp(rnorm(n(), mean = log_value, sd = log_se))) -> egg_sum

egg_sum %>% 
  group_by(Year) %>% 
  mutate(pp_mean = mean(pp_value),
         p025 = quantile(pp_value, 0.025),
         p975 = quantile(pp_value, 0.975)) -> egg_sum

df <- as.data.frame(D[["data_egg_dep"]][ , 1:2])
colnames(df) <- c("Year", "obs")

egg_sum %>% distinct(Year, mean, q025, q975, p025, p975) %>% 
  ungroup() %>% 
  mutate(Year = as.numeric(as.character(Year))) -> egg_sum2

df %>% 
  filter(Year >= D[["mod_syr"]]) -> df
axisx <- tickr(df, Year, 5)
df %>% 
  # 2008 survey egg estimates were extreme high and variable, pmin() helps
  # reduce the scale of the upper confidence interval for better visualization:
  mutate(egg_upper = pmin(obs + LS_byyear$egg_upper, max(obs) * 1.2),
         egg_lower = obs - LS_byyear$egg_lower,
         pred = D[["pred_egg_dep"]]) %>% 
  ggplot(aes(x = Year)) +
  # Credibility intervals. 
  geom_point(aes(y = obs, shape = "Historical estimates from survey")) +
  geom_ribbon(data = egg_sum2, aes(x = Year, ymin = p025, ymax = p975),
              alpha = 0.6, fill = "grey70") +
  geom_ribbon(data = egg_sum2, aes(x = Year, ymin = q025, ymax = q975),
              alpha = 0.6, fill = "grey40") +
  geom_line(data = egg_sum2, aes(x = Year, y = mean)) + 
  # Egg dep confidence intervals (To add whiskers, remove width=0)
  geom_errorbar(aes(ymin = egg_lower, ymax = egg_upper), colour = "black", width = 0, size = 0.001) +  scale_colour_manual(values = "grey") +
  theme(legend.position = c(0.25, 0.8),
        legend.spacing.y = unit(0, "cm")) +
  scale_x_continuous(breaks = axisx$breaks, labels = axisx$labels) +
  scale_y_continuous(limits = c(0, max(df$obs) * 1.2)) +
  labs(x = NULL, y = "Eggs spawned (trillions)\n", shape = NULL, linetype = NULL)

jysullivan commented 5 years ago

Caption: Light grey is 95% posterior predictive interval, dark grey is 95% credible interval.

jysullivan commented 5 years ago

The previous commit was producing sensible posterior predictive intervals for the egg deposition or the compositions:

  // Estimate posterior predictive intervals for abundance indices and composition data
  int pp_seed = 123; 
  random_number_generator rng(pp_seed);

  // Posterior predictive interval for predicted catch
  dvector pp_catch(mod_syr,mod_nyr);   // empty vector to hold prediction from each posterior sample
  pp_catch.initialize();
  double mu_catch;
  double sigma_catch;

  // Sample from lognormal distribution 
  for(int i = mod_syr; i <= mod_nyr; i++) {
    mu_catch = value(log(pred_catch(i)));                     // assume prediction = mean, assume lognormal distribution
    sigma_catch = TINY + column(data_catch,3)(i);             // log_se
    pp_catch(i) = mfexp(rnorm(mu_catch,sigma_catch,rng));     // put prediction back on natural scale
  }

  // Posterior predictive intervals for egg deposition with lognormal distribution:
  dvector pp_egg_dep(mod_syr,mod_nyr);   // empty vector to hold prediction from each posterior sample
  pp_egg_dep.initialize();
  double mu_egg_dep;
  double sigma_egg_dep;

  // Sample from lognormal distribution 
  for(int i = mod_syr; i <= mod_nyr; i++) {
    mu_egg_dep = value(log(pred_egg_dep(i)));                      // assume prediction = mean, assume lognormal distribution
    sigma_egg_dep = TINY + column(data_egg_dep,3)(i);              // log_se
    pp_egg_dep(i) = mfexp(rnorm(mu_egg_dep,sigma_egg_dep,rng));    // put prediction back on natural scale
  }

  // Posterior predictive intervals for age comps with multivariate logistic distribution:
  dmatrix pp_sp_comp(mod_syr,mod_nyr,sage,nage);    // empty matrix to hold prediction from each posterior sample
  dmatrix pp_cm_comp(mod_syr,mod_nyr,sage,nage);    
  pp_sp_comp.initialize();
  pp_cm_comp.initialize();

  double pp_sp_tau2;  // variance estimates for each posterior sample (single variance across all years)
  double pp_cm_tau2;    

  double minp = 0.00; // threshold proportion <= grouped with next bin

  // spawning survey age composition
  dmatrix d_sp_comp = trans(trans(data_sp_comp).sub(sage,nage)).sub(mod_syr,mod_nyr);
  pp_sp_tau2 = value(dmvlogistic(d_sp_comp,pred_sp_comp,resd_sp_comp,pp_sp_tau2,minp,false));   // value() returns double

  // commercial fishery age composition
  dmatrix d_cm_comp = trans(trans(data_cm_comp).sub(sage,nage)).sub(mod_syr,mod_nyr);
  pp_cm_tau2 = value(dmvlogistic(d_cm_comp,pred_cm_comp,resd_cm_comp,pp_cm_tau2,minp,false));   // false = extract variance instead of nll

  // sample from multivariate logistic distribution
  for(int i = mod_syr; i <= mod_nyr; i++) {
    dvector pp_sp = value(pred_sp_comp(i));
    dvector pp_cm = value(pred_cm_comp(i));

    pp_sp_comp(i)(sage,nage) = rmvlogistic(pp_sp,pp_sp_tau2,pp_seed-i);
    pp_cm_comp(i)(sage,nage) = rmvlogistic(pp_cm,pp_cm_tau2,pp_seed-i);
  }

jysullivan commented 5 years ago

It's easy enough to produce the intervals in R for the egg deposition using rnorm(), so I'm going to try writing the rmvlogistic() in R instead.

commfish / AlaskaHerring

Posterior predictive intervals #10