ggcorrmat on steroids - Githubissues

I wanted to get your opinion on an approach I've been using. I wanted to compute a ggcorrmat(), but my data frame has not only numerical but also categorical and boolean columns.

I've been using the following code to compute a set of tests and get a general idea of which features have "some relation" to which other features:

library(tidyverse)
library(statsExpressions)

# the data that i'll be using for this example
data <- mtcars %>% mutate(
  brand = gsub(" .*", "", rownames(mtcars)),
  vs = as.logical(vs),
  am = as.logical(am)
)

# annotation of the features
feature_ann <- tibble(
  feature = colnames(data),
  type = case_when(
    map_lgl(data, is.numeric) ~ "numerical",
    map_lgl(data, function(x) length(unique(na.omit(x))) == 2) ~ "logical",
    TRUE ~ "categorical"
  )
)

# which statsExpressions functions to use for which comparisons
stat_funs <- tribble(
  ~ xtype, ~ ytype, ~ statfun,
  "numerical", "numerical", function(df, x, y) corr_test(df, !!x, !!y, type = "np"),
  "numerical", "categorical", function(df, x, y) oneway_anova(df, !!y, !!x, type = "np"),
  "numerical", "logical", function(df, x, y) two_sample_test(df, !!y, !!x, type = "np"),
  "categorical", "numerical", function(df, x, y) oneway_anova(df, !!x, !!y, type = "np"),
  "categorical", "categorical", function(df, x, y) contingency_table(df, !!x, !!y, type = "np"),
  "categorical", "logical", function(df, x, y) contingency_table(df, !!x, !!y, type = "np"),
  "logical", "numerical", function(df, x, y) two_sample_test(df, !!x, !!y, type = "np"),
  "logical", "categorical", function(df, x, y) contingency_table(df, !!x, !!y, type = "np"),
  "logical", "logical", function(df, x, y) contingency_table(df, !!x, !!y, type = "np")
)

# the start of a comparison
feature_crossing <- crossing(
  x = feature_ann$feature,
  y = feature_ann$feature
) %>%
  left_join(feature_ann %>% select(x = feature, xtype = type), by = "x") %>%
  left_join(feature_ann %>% select(y = feature, ytype = type), by = "y") %>%
  left_join(stat_funs, by = c("xtype", "ytype"))

# the test results
stats_df <-
  feature_crossing %>%
  filter(x != y) %>%
  pmap_df(function(x, y, xtype, ytype, statfun) {
    odn <- data %>% select(!!x, !!y) %>% na.omit()
    # if a comparison does not have enough overlapping data points, don't run the test
    if (length(unique(odn[[x]])) <= 1 || length(unique(odn[[y]])) <= 1) return(NULL)
    statfun(data, x, y) %>%
      mutate(parameter1 = x, parameter2 = y, p1type = xtype, p2type = ytype) %>%
      select(parameter1:p2type, everything())
  }) %>%
  mutate(p.value = p.adjust(p.value, method = "holm"))

stats_df %>% select(parameter1:estimate) %>% head(10)

Head of `stats_df`:	parameter1	parameter2	p1type	p2type	statistic	df	p.value	method	effectsize
am	brand	logical	categorical	29.9271255	21	1.0000000	Pearson's Chi-squared test	Cramer's V (adj.)	0.5161364
am	carb	logical	numerical	132.5000000	NA	1.0000000	Wilcoxon rank sum test	r (rank biserial)	0.0728745
am	cyl	logical	numerical	194.0000000	NA	0.2027903	Wilcoxon rank sum test	r (rank biserial)	0.5708502
am	disp	logical	numerical	214.0000000	NA	0.0384542	Wilcoxon rank sum test	r (rank biserial)	0.7327935
am	drat	logical	numerical	24.0000000	NA	0.0111300	Wilcoxon rank sum test	r (rank biserial)	-0.8056680
am	gear	logical	numerical	16.0000000	NA	0.0007906	Wilcoxon rank sum test	r (rank biserial)	-0.8704453
am	hp	logical	numerical	176.0000000	NA	1.0000000	Wilcoxon rank sum test	r (rank biserial)	0.4251012
am	mpg	logical	numerical	42.0000000	NA	0.1085407	Wilcoxon rank sum test	r (rank biserial)	-0.6599190
am	qsec	logical	numerical	153.0000000	NA	1.0000000	Wilcoxon rank sum test	r (rank biserial)	0.2388664
am	vs	logical	logical	0.9068826	1	1.0000000	Pearson's Chi-squared test	Cramer's V (adj.)	0.0000000

When I plot this stats_df, I get something like this: plot

Do you think there is some merit to this approach? Do you see some obvious flaws?

Kind regards, Robrecht

IndrajeetPatil / ggstatsplot

ggcorrmat on steroids #750