Creating graphs with confidence intervals - initial ideas

andrewbaxter439 commented 2 years ago

As a quick initial run of creating graphs, here's the running of the code currently in R/outputting graphs.R:

library(readr)
library(tidyverse)
library(SPHSUgraphs)

out_data <-
  read_csv("C:/Programming/covid19_effect_estimates/data/new_data.csv",
           show_col_types = FALSE)

# tidying dataset ---------------------------------------------------------

compare_results <- out_data |>
  filter(grp_all == TRUE, !is.na(run)) |>
  select(-contains("eff"), -starts_with("grp")) |>
  pivot_longer(
    -c(scenario, run, time),
    names_to = c("metric", "outcome", "policy"),
    values_to = "val",
    names_pattern = "(.*)_(.*)_(baseline|reform)"
  ) |>
  pivot_wider(
    c(scenario, run, time, outcome, policy),
    names_from = metric,
    values_from = val
  )

# faceted graph -----------------------------------------------------------

compare_results |>
  ggplot(aes(time, out, colour = policy, fill = policy)) +
  geom_vline(xintercept = 2019, colour = "red") +
  stat_summary(
    fun.data = mean_se,
    geom = "ribbon",
    alpha = 0.5,
    colour = NA
  ) +
  stat_summary(fun.data = mean_se, geom = "line") +
  stat_summary(fun.data = mean_se, geom = "point") +
  facet_wrap(~ outcome, scales = "free_y") +
  scale_fill_manual(
    "Policy",
    aesthetics = c("fill", "colour"),
    labels = c("Baseline", "Covid policy"),
    values = sphsu_cols("University Blue", "Thistle", names = FALSE)
  ) +
  theme(legend.position = "bottom")

As a small initial point - these outputs currently have a very small range (accidently put just as standard error of means across 50 runs in file). Should intervals combine the sd's of the means of each run, rather than taking the variance between the mean outputs?

Suggested edits so far:

[x] Output separate graphs
[x] Update ribbons to 95% CIs (rather than default se)
[x] Correct variable names and titles
[ ] Standardise colour and display

^{Created on 2022-07-15 by the reprex package (v2.0.1)}

andrewbaxter439 commented 2 years ago

Changed to 95%CIs in dc063e5

andrewbaxter439 commented 2 years ago

Other key question @dkopasker is what would be best to show the CIs - this is using ribbons but can also do line-range for each point instead.

dkopasker commented 2 years ago

Conducting multiple runs is the equivalent of bootstrapping. The 95% confidence interval is defined by the observations ranked at the lowest and highest 2.5 percentile, this is why ranking has been performed. This may give a narrow range, but I don't think it is what you have done above.

A ribbon that includes the areas between points would be my preference for displaying confidence intervals. Others would object to this since we do not have observations between annual points. It's a matter of preference that is likely to be decided by a peer reviewer. Code to do both would be useful.

Please label the y-axis and state what the red line means. A note indicated how many runs were conducted would also be useful. I also prefer a white background with one black and one grey line for publications.

andrewbaxter439 commented 2 years ago

Thanks @dkopasker - minor facepalm there for me then, I do see now what you mean! Will sort these points shortly

vkhodygo commented 2 years ago

@andrewbaxter439 Any progress so far?

andrewbaxter439 commented 2 years ago

apologies - not had a chance to look at this again since last changes. Next plans would be to turn into function runnable for each variable. Is there a key deadline for this?

vkhodygo commented 2 years ago

not had a chance to look at this again since last changes

No worries.

Is there a key deadline for this?

I'm not aware of any, however, we need to ask @dkopasker when it comes to deadlines.

dkopasker commented 2 years ago

Not a deadline as such. I expect we will re-run the 1,000 run simulations near the end of this month with the intention of writing up. It would be helpful if publication quality graphs with confidence intervals can be readily produced by the large output files.

vkhodygo commented 2 years ago

I's say mid-September is our target date then. I expect to be knee-deep in the code, teaching, and on leave till the end of this year, it' be great not to delay the development of essential parts of the code. Any minor corrections/improvements could be introduced later as we go.

andrewbaxter439 commented 2 years ago

Created a function in R/graphing_functions.R which can now graph a baseline/comparison line graph with assignable y-axis labels.

library(readr)
library(tidyverse)
source("R/graphing_functions.R")

out_data <-
  read_csv("C:/Programming/covid19_effect_estimates/data/new_data.csv",
           show_col_types = FALSE)

out_data |>
  graph_policy_comparisons(out_ghq_baseline, out_ghq_reform,  y_lab = "GQH score")
#> Loading required package: SPHSUgraphs

dkopasker commented 2 years ago

Thanks Andy. This looks useful. For comparison, could you please form a graph without ribbons but with the uncertainty interval marked for each point only? As mentioned above, some may object to areas between points being covered by a confidence interval when such points were not estimated.

dkopasker commented 2 years ago

Eventually, we will need to produce these graphs using the median. It may be worth retaining code for both the mean and median to allow for the preferences of peer reviewers.

vkhodygo commented 2 years ago

It may be worth retaining code for both the mean and median

That's a given, I'll try to fix this.

andrewbaxter439 commented 2 years ago

Added two options for using mean instead of median for points (CIs should still be quantiles I imagine?) and using other geoms such as errorbar for ranges:

out_data |>
  graph_policy_comparisons(
    out_ghq_baseline,
    out_ghq_reform,
    y_lab = "GQH score",
    agg_method = mean,
    ci_geom = "errorbar"
  )

MRC-CSO-SPHSU / effect_estimates

Creating graphs with confidence intervals - initial ideas #6