AndriSignorell / DescTools

Tools for Descriptive Statistics and Exploratory Data Analysis
http://andrisignorell.github.io/DescTools/
82 stars 18 forks source link

Confidence Levels #121

Closed chris-swenson closed 6 months ago

chris-swenson commented 1 year ago

I attempted to conduct a Bonferroni-adjusted confidence level with correlations using CorCI; however, I discovered that the confidence level I passed in, using the Bonferroni adjustment (1 - alpha / (2 * k)) where k = number of correlations conducted, that the CorCI code does an additional /2, leading my results to be incorrect. The same is true with MeanCI and likely other CI methods.

I would say this is unexpected behavior. Without it explicitly stated in the documentation that this is occurring, it is quite confusing to attempt to use a Bonferroni adjustment. I would either explicitly state this in the documentation, provide an option to not adjust the conf.level, or do not adjust the conf.level at all.

It would seem to me the best option is to update the documentation to make this change explicit.

chris-swenson commented 1 year ago

I noticed this is the same case for the t.test function, which is in the stats package. Apparently it is just understood that this is what happens when a confidence level is specified, regardless of any alterations someone might want to do (e.g., Bonferroni adjustment).

GegznaV commented 1 year ago

@chris-swenson, could you, please, provide a minimal working code example (reprex) that illustrates the issue?

chris-swenson commented 1 year ago

@GegznaV Sure, here you go:

library(DescTools)

df <- subset(datasets::iris, select = -c(Species))
summary(df)

n = nrow(df)
# number of correlations to test and control for all at once
# 4 * 3 / 2 = 6
k = 6
df_cor <- cor(df)

# 6 tests, so 1 - alpha / (2 * k) where k is the number of interpretations
conf_level = 1 - 0.05 / (2 * k)

CorCI_results <- data.frame(rbind(
  CorCI(df_cor[1, 2], n, conf.level=conf_level),
  CorCI(df_cor[1, 3], n, conf.level=conf_level),
  CorCI(df_cor[2, 3], n, conf.level=conf_level),
  CorCI(df_cor[1, 4], n, conf.level=conf_level),
  CorCI(df_cor[2, 4], n, conf.level=conf_level),
  CorCI(df_cor[3, 4], n, conf.level=conf_level)
))

df_z <- atanh(df_cor)
diag(df_z) <- NA
z_crit <- qnorm(1 - 0.05/(2*k))
df_z_low <- df_z - z_crit / sqrt(n - 3)
df_z_upp <- df_z + z_crit / sqrt(n - 3)
df_low <- (exp(2 * df_z_low) - 1) / (exp(2 * df_z_low) + 1)
df_upp <- (exp(2 * df_z_upp) - 1) / (exp(2 * df_z_upp) + 1)

manual_results <- data.frame(t(rbind(
  df_cor[upper.tri(df_cor)],
  df_low[upper.tri(df_low)],
  df_upp[upper.tri(df_upp)]
)))
colnames(manual_results) <- c('cor', 'lwr.ci', 'upr.ci')

CorCI_results
manual_results

# Correct way to do Bonferroni with CorCI
# Since CorCI divides conf.level by 2
conf_level = 1 - 0.05 / k
CorCI_results <- data.frame(rbind(
  CorCI(df_cor[1, 2], n, conf.level=conf_level),
  CorCI(df_cor[1, 3], n, conf.level=conf_level),
  CorCI(df_cor[2, 3], n, conf.level=conf_level),
  CorCI(df_cor[1, 4], n, conf.level=conf_level),
  CorCI(df_cor[2, 4], n, conf.level=conf_level),
  CorCI(df_cor[3, 4], n, conf.level=conf_level)
))

CorCI_results
manual_results
chris-swenson commented 1 year ago

The same problem occurs with t.test in the stats package. They divide by 2 which is not a great practice since it is not specified in the documentation that this transformation occurs.

# Wrong
conf_level = 1 - 0.05 / (2 * k)
data.frame(rbind(
  stats::t.test(df$Sepal.Length, df$Sepal.Width, conf.level = conf_level)$conf.int[1:2],
  stats::t.test(df$Sepal.Length, df$Petal.Length, conf.level = conf_level)$conf.int[1:2],
  stats::t.test(df$Sepal.Length, df$Petal.Width, conf.level = conf_level)$conf.int[1:2],
  stats::t.test(df$Sepal.Width, df$Petal.Length, conf.level = conf_level)$conf.int[1:2],
  stats::t.test(df$Sepal.Width, df$Petal.Width, conf.level = conf_level)$conf.int[1:2],
  stats::t.test(df$Petal.Length, df$Petal.Width, conf.level = conf_level)$conf.int[1:2]
))

# Right
conf_level = 1 - 0.05 / (k)
data.frame(rbind(
  stats::t.test(df$Sepal.Length, df$Sepal.Width, conf.level = conf_level)$conf.int[1:2],
  stats::t.test(df$Sepal.Length, df$Petal.Length, conf.level = conf_level)$conf.int[1:2],
  stats::t.test(df$Sepal.Length, df$Petal.Width, conf.level = conf_level)$conf.int[1:2],
  stats::t.test(df$Sepal.Width, df$Petal.Length, conf.level = conf_level)$conf.int[1:2],
  stats::t.test(df$Sepal.Width, df$Petal.Width, conf.level = conf_level)$conf.int[1:2],
  stats::t.test(df$Petal.Length, df$Petal.Width, conf.level = conf_level)$conf.int[1:2]
))
AndriSignorell commented 6 months ago

Hmm, after quite a while I can't see here any misbehaviour neither of t.test (which anyway would have been surprising) nor of CorCI. Two-sided confidence intervals are calculated correctly in both functions. In my opinion, further documentation is not necessary as this is described in every statistics textbook.