dcomtois / summarytools

R Package to Quickly and Neatly Summarize Data
504 stars 77 forks source link

Remove columns used for grouping from dfSummary output #132

Closed ScaonE closed 3 years ago

ScaonE commented 3 years ago

Dear all,

I find columns used for grouping pretty uninformative in the dfSummary output, given that they will always contain a single value with 100% freq

I didn't find the option which would allow me to remove them, so I tried myself using :

# Create grouped dfSummary result
iris_summary <-
  iris %>%
  group_by(Species) %>%
  dfSummary()
# Check data
iris_summary[[1]]

Data Frame Summary
iris
Group: Species = setosa
Dimensions: 50 x 5
Duplicates: 0


No Variable Stats / Values Freqs (% of Valid) Graph Valid Missing


1 Sepal.Length Mean (sd) : 5 (0.4) 15 distinct values : . 50 0
[numeric] min < med < max: : : (100%) (0%)
4.3 < 5 < 5.8 : : : .
IQR (CV) : 0.4 (0.1) . : : : : :
: : : : : : : :

2 Sepal.Width Mean (sd) : 3.4 (0.4) 16 distinct values : 50 0
[numeric] min < med < max: : (100%) (0%)
2.3 < 3.4 < 4.4 : .
IQR (CV) : 0.5 (0.1) . : :
: : : .

3 Petal.Length Mean (sd) : 1.5 (0.2) 1.00 : 1 ( 2.0%) 50 0
[numeric] min < med < max: 1.10 : 1 ( 2.0%) (100%) (0%)
1 < 1.5 < 1.9 1.20 : 2 ( 4.0%)
IQR (CV) : 0.2 (0.1) 1.30 : 7 (14.0%) II
1.40 : 13 (26.0%) IIIII
1.50 : 13 (26.0%) IIIII
1.60 : 7 (14.0%) II
1.70 : 4 ( 8.0%) I
1.90 : 2 ( 4.0%)

4 Petal.Width Mean (sd) : 0.2 (0.1) 0.10 : 5 (10.0%) II 50 0
[numeric] min < med < max: 0.20 : 29 (58.0%) IIIIIIIIIII (100%) (0%)
0.1 < 0.2 < 0.6 0.30 : 7 (14.0%) II
IQR (CV) : 0.1 (0.4) 0.40 : 7 (14.0%) II
0.50 : 1 ( 2.0%)
0.60 : 1 ( 2.0%)

5 Species 1. setosa 50 (100.0%) IIIIIIIIIIIIIIIIIIII 50 0
[factor] 2. versicolor 0 ( 0.0%) (100%) (0%)

  1. virginica 0 ( 0.0%)
# Check data
str(iris_summary)

List of 3 $ :Classes ‘summarytools’ and 'data.frame': 5 obs. of 8 variables: ..$ No : num [1:5] 1 2 3 4 5 ..$ Variable : chr [1:5] "Sepal.Length\\n[numeric]" "Sepal.Width\\n[numeric]" "Petal.Length\\n[numeric]" "Petal.Width\\n[numeric]" ...

# Retrieve the grouping variable string used in dfSummary output
(grp_var_string <-
    iris_summary[[1]]$Variable %>%
  grep("species",
       .,
       ignore.case = TRUE,
       value = TRUE))

[1] "Species\\n[factor]"

# Remove grouping variable from dfSummary output
iris_summary %<>%
  map(~ .x %>%
        filter(!Variable %in% grp_var_string))
# Check data
iris_summary[[1]]

Data Frame Summary
iris
Group: Species = setosa
Dimensions: 50 x 5
Duplicates: 0


No Variable Stats / Values Freqs (% of Valid) Graph Valid Missing


1 Sepal.Length Mean (sd) : 5 (0.4) 15 distinct values : . 50 0
[numeric] min < med < max: : : (100%) (0%)
4.3 < 5 < 5.8 : : : .
IQR (CV) : 0.4 (0.1) . : : : : :
: : : : : : : :

2 Sepal.Width Mean (sd) : 3.4 (0.4) 16 distinct values : 50 0
[numeric] min < med < max: : (100%) (0%)
2.3 < 3.4 < 4.4 : .
IQR (CV) : 0.5 (0.1) . : :
: : : .

3 Petal.Length Mean (sd) : 1.5 (0.2) 1.00 : 1 ( 2.0%) 50 0
[numeric] min < med < max: 1.10 : 1 ( 2.0%) (100%) (0%)
1 < 1.5 < 1.9 1.20 : 2 ( 4.0%)
IQR (CV) : 0.2 (0.1) 1.30 : 7 (14.0%) II
1.40 : 13 (26.0%) IIIII
1.50 : 13 (26.0%) IIIII
1.60 : 7 (14.0%) II
1.70 : 4 ( 8.0%) I
1.90 : 2 ( 4.0%)

4 Petal.Width Mean (sd) : 0.2 (0.1) 0.10 : 5 (10.0%) II 50 0
[numeric] min < med < max: 0.20 : 29 (58.0%) IIIIIIIIIII (100%) (0%)
0.1 < 0.2 < 0.6 0.30 : 7 (14.0%) II
IQR (CV) : 0.1 (0.4) 0.40 : 7 (14.0%) II
0.50 : 1 ( 2.0%)
0.60 : 1 ( 2.0%)

# Check data
str(iris_summary)

List of 3 $ :Classes ‘summarytools’ and 'data.frame': 4 obs. of 8 variables: ..$ No : num [1:4] 1 2 3 4 ..$ Variable : chr [1:4] "Sepal.Length\\n[numeric]" "Sepal.Width\\n[numeric]" "Petal.Length\\n[numeric]" "Petal.Width\\n[numeric]"

So far it seems good, I still have a summarytools output, without the grouping variable in each group

# Write output to HTML
iris_summary %>%
  view(file = "iris_summary.html")

x must either be a summarytools object created with freq(), descr(), or a list of summarytools objects created using by()

But this fails, any tips / suggestions ?

dcomtois commented 3 years ago

Seems like map() (from what package is it?) is not retaining the class of the object... You may try setting it expliticly, as in this example:

library(dplyr)
library(summarytools)
iris_summary <-
  iris %>%
  group_by(Species) %>%
  dfSummary()

class(iris_summary) # stby
iris_summary <- lapply(iris_summary, function(x) x[-5,])

class(iris_summary) # list
class(iris_summary) <- "stby"

view(iris_summary)
dcomtois commented 3 years ago

Grouping variables are now excluded by default from dfSummaries. They can be retained using keep.grp.vars = TRUE, either in dfSummary() or in print/view, as the masking occurs in the printing phase.

It can be tested by installing the dev-current branch.

ScaonE commented 3 years ago

Thanks for your input

class(iris_summary) <- "stby"

This did the trick for me. I didn't spend much time trying to use the dev-current version as I encountered an installation error (this is on my end)

I allow myself to ask another question within this thread : Is it possible to display Q1 and Q3 instead of IQR (CV) in the dfSummary output ?

dcomtois commented 3 years ago

Glad it worked out.

For the dfSummary stats, the problem is that everyone has their preferences... I might at some point include an optional "additional row" of stats, which could include Q1/Q3. But feel free to fork the package and modify it to suit your needs if you feel like experimenting a bit :)

Curious to know, the map() function you used, is it part of the purrr package?

ScaonE commented 3 years ago

I know all these features are a lot of work, so no problem, the tool is already great as is ! (I saw related "issues" here & here).

Yes it was purrr::map(). I am no expert with purrr, so there might be a map() variant which would have allowed me to keep the stby class.

ScaonE commented 3 years ago

One last question if I may : Can you point me to a description / explanation about IQR (CV) "vs" Q1 & Q3 ? (I am so used to being asked to report Q1 & Q3, but rarely IQR) (thus a good explanation might help me convince people about the usefulness of IQR)

dcomtois commented 3 years ago

Well, I wouldn't say one set is clearly "superior" to the other... It is true that knowing Q1 and Q3, we're only one quick operation away from knowing the IQR. On the other hand, having both the IQR and the CV in the summary seems like a good compromise, given the space available, as it gives you a robust measure of dispersion (IQR), plus a "standardized" (relative is more precise a term though) value for dispersion (CV) that allows you to compare variability across variables that are on totally different scales.

Already knowing the min, median and max (from the summary table), the IQR provides a faster way to picture mentally the kurtosis, while having Q1 and Q3 would shift the focus towards the skewness. And this is arguably the downside of IQR as opposed to Q1 & Q3, that it doesn't give you information about skewness. However, the histograms are there to help us with that.

The visual aspect is always on my mind when deciding what to include; the amount of space it takes, but also the way it integrates with the rest; for instance, including a fivenum would make sense, theoretically speaking, but it would look quite odd.

dcomtois commented 3 years ago

Fixed & added optional additional stats to show -- example for Q1 & Q3 provided here: https://raw.githubusercontent.com/dcomtois/summarytools/master/doc/Custom-Statistics-in-dfSummary.pdf