dcomtois / summarytools

R Package to Quickly and Neatly Summarize Data
522 stars 78 forks source link

descr does not calculate statistics (e.g. min, max) correctly if the column names contain exactly the same postfixes as the statistics function string (e.g. "column_min" or "column_max") #152

Open yenchiayi opened 3 years ago

yenchiayi commented 3 years ago

I have a small data.frame with dimension = (2, 3) as follows:

column0 column1 column2
1 11 21
2 12 22

The descr function calculates everything correctly if I set column names as c("x", "x_1", "x_2"):

df <- data.frame(
  x = 1:2,
  x_1 = 11:12,
  x_2 = 21:22
)
df %>% 
  summarytools::descr(stats = c( "min", "max", "n.valid", "skewness", "kurtosis")) 
x x_1 x_2
Min 1.00 11.00 21.00
Max 2.00 12.00 22.00
N.Valid 2.00 2.00 2.00
Skewness 0.00 0.00 0.00
Kurtosis -2.75 -2.75 -2.75

However, if I set column names as c("x", "x_min", "x_max"), then descr does not calculate minimum and maximum (as well as other statistics like "n.valid", "skewness", and "kurtosis" ) correctly.

df <- data.frame(
  x = 1:2,
  x_min = 11:12,
  x_max = 21:22
)
df %>% 
  summarytools::descr(stats = c( "min", "max", "n.valid", "skewness", "kurtosis"))

As seen in below output, the Min of column 2 (x_max) is even larger than its Max. Other statistics like N.Valid, "Skewness", and "Kurtosis" are also wrong for the column "x_max" and "x_min".

x x_max x_min
Min 1.00 21 1
Max 2.00 2 1
N.Valid 2.00 1 1
Skewness 0.00 NA NA
Kurtosis -2.75 NA NA

My preliminary guess is that the the program may fail to distinguish the column name postfix (e.g. x_min) and the function name (e.g. min). I found that this issue arises around line 367-373 In descr.R. You may check this and see what happens.

image

Thanks!