Open rcannood opened 2 years ago
I wanted to get your opinion on an approach I've been using. I wanted to compute a ggcorrmat()
, but my data frame has not only numerical but also categorical and boolean columns.
I've been using the following code to compute a set of tests and get a general idea of which features have "some relation" to which other features:
library(tidyverse)
library(statsExpressions)
# the data that i'll be using for this example
data <- mtcars %>% mutate(
brand = gsub(" .*", "", rownames(mtcars)),
vs = as.logical(vs),
am = as.logical(am)
)
# annotation of the features
feature_ann <- tibble(
feature = colnames(data),
type = case_when(
map_lgl(data, is.numeric) ~ "numerical",
map_lgl(data, function(x) length(unique(na.omit(x))) == 2) ~ "logical",
TRUE ~ "categorical"
)
)
# which statsExpressions functions to use for which comparisons
stat_funs <- tribble(
~ xtype, ~ ytype, ~ statfun,
"numerical", "numerical", function(df, x, y) corr_test(df, !!x, !!y, type = "np"),
"numerical", "categorical", function(df, x, y) oneway_anova(df, !!y, !!x, type = "np"),
"numerical", "logical", function(df, x, y) two_sample_test(df, !!y, !!x, type = "np"),
"categorical", "numerical", function(df, x, y) oneway_anova(df, !!x, !!y, type = "np"),
"categorical", "categorical", function(df, x, y) contingency_table(df, !!x, !!y, type = "np"),
"categorical", "logical", function(df, x, y) contingency_table(df, !!x, !!y, type = "np"),
"logical", "numerical", function(df, x, y) two_sample_test(df, !!x, !!y, type = "np"),
"logical", "categorical", function(df, x, y) contingency_table(df, !!x, !!y, type = "np"),
"logical", "logical", function(df, x, y) contingency_table(df, !!x, !!y, type = "np")
)
# the start of a comparison
feature_crossing <- crossing(
x = feature_ann$feature,
y = feature_ann$feature
) %>%
left_join(feature_ann %>% select(x = feature, xtype = type), by = "x") %>%
left_join(feature_ann %>% select(y = feature, ytype = type), by = "y") %>%
left_join(stat_funs, by = c("xtype", "ytype"))
# the test results
stats_df <-
feature_crossing %>%
filter(x != y) %>%
pmap_df(function(x, y, xtype, ytype, statfun) {
odn <- data %>% select(!!x, !!y) %>% na.omit()
# if a comparison does not have enough overlapping data points, don't run the test
if (length(unique(odn[[x]])) <= 1 || length(unique(odn[[y]])) <= 1) return(NULL)
statfun(data, x, y) %>%
mutate(parameter1 = x, parameter2 = y, p1type = xtype, p2type = ytype) %>%
select(parameter1:p2type, everything())
}) %>%
mutate(p.value = p.adjust(p.value, method = "holm"))
stats_df %>% select(parameter1:estimate) %>% head(10)
Head of stats_df : |
parameter1 | parameter2 | p1type | p2type | statistic | df | p.value | method | effectsize | estimate |
---|---|---|---|---|---|---|---|---|---|---|
am | brand | logical | categorical | 29.9271255 | 21 | 1.0000000 | Pearson's Chi-squared test | Cramer's V (adj.) | 0.5161364 | |
am | carb | logical | numerical | 132.5000000 | NA | 1.0000000 | Wilcoxon rank sum test | r (rank biserial) | 0.0728745 | |
am | cyl | logical | numerical | 194.0000000 | NA | 0.2027903 | Wilcoxon rank sum test | r (rank biserial) | 0.5708502 | |
am | disp | logical | numerical | 214.0000000 | NA | 0.0384542 | Wilcoxon rank sum test | r (rank biserial) | 0.7327935 | |
am | drat | logical | numerical | 24.0000000 | NA | 0.0111300 | Wilcoxon rank sum test | r (rank biserial) | -0.8056680 | |
am | gear | logical | numerical | 16.0000000 | NA | 0.0007906 | Wilcoxon rank sum test | r (rank biserial) | -0.8704453 | |
am | hp | logical | numerical | 176.0000000 | NA | 1.0000000 | Wilcoxon rank sum test | r (rank biserial) | 0.4251012 | |
am | mpg | logical | numerical | 42.0000000 | NA | 0.1085407 | Wilcoxon rank sum test | r (rank biserial) | -0.6599190 | |
am | qsec | logical | numerical | 153.0000000 | NA | 1.0000000 | Wilcoxon rank sum test | r (rank biserial) | 0.2388664 | |
am | vs | logical | logical | 0.9068826 | 1 | 1.0000000 | Pearson's Chi-squared test | Cramer's V (adj.) | 0.0000000 |
When I plot this stats_df, I get something like this:
Do you think there is some merit to this approach? Do you see some obvious flaws?
Kind regards, Robrecht
Hi @IndrajeetPatil !
Thank you for creating ggstatsplot and statsExpressions! :relaxed: I love having a consistent API for running different types of tests.