The interpretation of the diagnostic plot

nuferates commented 1 year ago

The first outcome of the ROBMED results is the diagnostic plot. We already explained how to make sense of it in our ORM piece, but I think we should not assume that readers will be interested enough to explore how it is used. So can we add a question mark icon next to the diagnostic plot title and when the users hover their mouse over they get to see a generic explanation of how it is interpreted? It could even be nicer if we can generate a specific interpretation for each graph underneath (can AI help?), but it might be well beyond the scope of this project.

aalfons commented 1 year ago

Indeed, it would be extremely useful if there is some help that can be displayed. This is not just relevant for the diagnostic plot, but it may also be useful for other things.

Focusing on the diagnostic plot in this issue, we have some plots from various presentations that show what the plot looks like when there are no deviations from normality, when we have a heavily tailed distribution, and when we have left- or right skewness:

diagnostics-normal.pdf diagnostics-heavy-tails.pdf diagnostics-left-skewed.pdf diagnostics-right-skewed.pdf

Those plots are generated with the following script:

# --------------------------------------
# Author: Andreas Alfons
#         Erasmus Universiteit Rotterdam
# --------------------------------------

# load packages
library("robmed")
library("scales")
library("sn")

## control parameters for simulation
seed <- 20210617  # seed for the random number generator
n <- 200          # number of observations
K <- 1000         # number of simulation runs
R <- 5000         # number of bootstrap samples
# coefficients in mediation model
a <- b <- c <- 0.4
# error scales in mediation model
sigma_m <- sqrt(1 - a^2)
sigma_y <- sqrt(1 - b^2 - c^2 - 2*a*b*c)

## parameter combinations and fancy labels for plots
alpha <- c(0, 0, -Inf, Inf)
nu <- c(Inf, 2, Inf, Inf)
labels <- c("normal", "heavy-tails", "left-skewed", "right-skewed")

## control parameters for methods
level <- 0.95                                        # confidence level
lmrob_control <- reg_control(max_iterations = 5000)  # MM-estimator

# function to compute mean of skew-t distribution
get_mean <- function(ksi = 0, omega = 1, alpha = 0, nu = Inf) {
  if (is.infinite(alpha)) delta <- sign(alpha)
  else delta <- alpha / sqrt(1 + alpha^2)
  if (is.infinite(nu)) b_nu <- sqrt(2 / pi)
  else b_nu <- sqrt(nu) * beta((nu-1)/2, 0.5) / (sqrt(pi) * gamma(0.5))
  # compute mean
  ksi + omega * delta * b_nu
}

## set seed of the random number generator for reproducibility
set.seed(seed)

## generate independent variable and error terms such that results are
## comparable across different values of the parameters
x <- rnorm(n)
# First uniformly distributed values are drawn, which are later transformed
# to the given error distribution by applying the inverse CDF.
u_m <- runif(n)
u_y <- runif(n)

## loop over parameter settings
for (i in 1:length(labels)) {

  # transform errors
  if (is.infinite(nu[i])) {
    # normal or skew-normal distribution
    if (is.finite(alpha[i]) && alpha[i] == 0) {
      # normal errors
      e_m <- qnorm(u_m)
      e_y <- qnorm(u_y)
    } else {
      # skew-normal errors
      e_m <- qsn(u_m, alpha = alpha[i])
      e_y <- qsn(u_y, alpha = alpha[i])
    }
  } else {
    # t or skew-t distribution
    if (is.finite(alpha[i]) && alpha[i] == 0) {
      # t-distributed errors
      e_m <- qt(u_m, df = nu[i])
      e_y <- qt(u_y, df = nu[i])
    } else {
      # skew-t distributed errors
      e_m <- qst(u_m, alpha = alpha[i], nu = nu[i])
      e_y <- qst(u_y, alpha = alpha[i], nu = nu[i])
    }
  }

  # compute mean of error terms such that it can be subtracted
  mu_m <- get_mean(omega = sigma_m, alpha = alpha[i], nu = nu[i])
  mu_y <- get_mean(omega = sigma_y, alpha = alpha[i], nu = nu[i])

  # transform error terms (scale and center)
  e_m_transformed <- sigma_m * e_m - mu_m
  e_y_transformed <- sigma_y * e_y - mu_y

  # generate hypothesized mediator and dependent variable
  m <- a * x + e_m_transformed
  y <- b * m + c * x + e_y_transformed
  simulated_data <- data.frame(x, y, m)

  # fit mediation model
  robust_fit <- fit_mediation(simulated_data, "x", "y", "m",
                              control = lmrob_control)

  # create plot
  p <- weight_plot(robust_fit) +
    scale_color_manual("", values = c("black", "#00BFC4"))

  # save plot to file
  pdf(file = sprintf("figures/diagnostics-%s.pdf", labels[i]),
      width = 6, height = 4.25)
  print(p)
  dev.off()

}

aalfons commented 1 year ago

I don't want to attempt anything regarding AI generated interpretations of a given plot. Users might put that into papers without checking, which causes ethical issues regarding authorship and may reflect negatively on our software. I'm in favor of humans creating their own interpretations.

It would be our job to help them by having a clear explanation. Another suggestion would be to add confidence bands to make it clear if deviations from the expected line are significant, see this issue for package robmed: https://github.com/aalfons/robmed/issues/30

aalfons commented 1 year ago

@AuroreAA, do you have suggestions on how to implement such help functionality?

AuroreAA commented 1 year ago

I can think about different solutions:

either just a tooltip can be displayed on hover from the text and also link to another pdf file which explains everything in detail.
we can also convert the plot to an interactive one and adjust the hover message regarding the differences between the values
we can directly add the information above or below the plot on how to interpret it.
we can also have a dropdown menu on the plot.

aalfons / robmedExtra

The interpretation of the diagnostic plot #29