IDEMSInternational / R-Instat

A statistics software package powered by R
http://r-instat.org/
GNU General Public License v3.0
38 stars 103 forks source link

What should our default be in calculation system for `.preserve` in `dplyr::filter`? #5352

Open dannyparsons opened 5 years ago

dannyparsons commented 5 years ago

dplyr::filter has gained a .preserve argument which can preserve original groupings after filtering.

Example,

df <- data.frame(
  x = c(1,4,6,7,1,2), 
  y = c(0, 0, 1, 4, 3, 1),
  year = c(1999, 1999, 2000, 2000, 2001, 2001)
)
df %>% group_by(year) %>% filter(y > 0) %>% summarise(sum_x = sum(x))
# # A tibble: 2 x 2
#    year mean_x
#   <dbl>  <dbl>
# 1  2000    13
# 2  2001     3

or now with .preserve = TRUE, we keep all the years in the summary.

df <- data.frame(
  x = c(1,4,6,7,1,2), 
  y = c(0, 0, 1, 4, 3, 1),
  year = c(1999, 1999, 2000, 2000, 2001, 2001)
)
df %>% group_by(year) %>% filter(y > 0, .preserve = TRUE) %>% summarise(sum_x = sum(x))
# # A tibble: 3 x 2
#    year sum_x
#   <dbl> <dbl>
# 1  1999     0
# 2  2000    13
# 3  2001     3

.preserve = TRUE solves our issues in climatic summaries like start of rains where we filter out every day in the year but would still like to report the start of rains as NA.

But should this be the default for our calculation system? You may not want this, for example, if you are also filtering on the years to get less rows in the summary table.

lilyclements commented 1 month ago

I just came across this, but I think this is timely since we've been updating the calculation system to have .drop and .preserve. I've assumed so far to keep .drop = TRUE and .preserve = FALSE by default, because that is how R-Instat was working before these parameters were created. However, I think it would be good to discuss this.

@rdstern do you have any strong thoughts on this?

rdstern commented 1 month ago

@lilyclements good question. Originally the dialog was contradictory, in that it did .drop=TRUE, but the disabled control indicated otherwise. I prefer to keep it that way. My reasoning is partly because I am very happy that our summarise works fine, even when the by variables are not factors. (Users suffer with Genstat, which is much stricter concerning factor variables. We really simplified tutorial 2 when Danny realised we don't have to bother users to make year a factor for some operations, while considing it as numeric for others. So I was a bit concerend that we had to bother about again with the start of the rains stuff - I don't think we do now. So our default gives the same result whether the by variables are factors, date, or numeric. .drop=FALSE is a bonus that is available to us, when all the by variables are factors!

Hope you agree. If so, then @rachelkg I think we could make that point in the help?