duttashi / learnr

Exploratory, Inferential and Predictive data analysis. Feel free to show your :heart: by giving a star :star:
MIT License
78 stars 55 forks source link

How to replace multiple summarize statements by a custom function? #46

Closed duttashi closed 5 years ago

duttashi commented 5 years ago

This question was originally asked on SO. Reproducing it here for reference purpose only.

A minimum example:

library(tidyverse)
col1 <- c("UK", "US", "UK", "US")
col2 <- c("Tech", "Social", "Social", "Tech")
col3 <- c("0-5years", "6-10years", "0-5years", "0-5years")
col4 <- 1:4
col5 <- 5:8

df <- data.frame(col1, col2, col3, col4, col5)

result1 <- df %>% 
  group_by(col1, col2) %>% 
  summarize(sum1 = sum(col4, col5))

result2 <- df %>% 
  group_by(col2, col3) %>% 
  summarize(sum1 = sum(col4, col5))

result3 <- df %>% 
  group_by(col1, col3) %>% 
  summarize(sum1 = sum(col4, col5))
duttashi commented 5 years ago

Possible solutions:

1: Using the base function combn()

combn(colnames(df)[1:3], 2, FUN = function(x){
  df %>% 
    group_by(.dots = x) %>% 
    summarize(sum1 = sum(col4, col5))
  }, simplify = FALSE)

[[1]]
# A tibble: 4 x 3
# Groups:   col1 [2]
  col1  col2    sum1
  <fct> <fct>  <int>
1 UK    Social    10
2 UK    Tech       6
3 US    Social     8
4 US    Tech      12

[[2]]
# A tibble: 3 x 3
# Groups:   col1 [2]
  col1  col3       sum1
  <fct> <fct>     <int>
1 UK    0-5years     16
2 US    0-5years     12
3 US    6-10years     8

[[3]]
# A tibble: 3 x 3
# Groups:   col2 [2]
  col2   col3       sum1
  <fct>  <fct>     <int>
1 Social 0-5years     10
2 Social 6-10years     8
3 Tech   0-5years     18

2: Using a custom function

To use dplyr in own functions, you can use tidy evaluation. The reason for this is the way dplyr evaluates dplyr code, something called non-standard evaluation, which wraps everything what does not behave like normal R Code. I recommend to read this:

https://tidyeval.tidyverse.org/modifying-inputs.html#modifying-quoted-expressions

summarizefunction <- function(data, ..., sumvar1, sumvar2) {

    groups <- enquos(...)
    sumvar1 <- enquo(sumvar1)
    sumvar2 <- enquo(sumvar2)

    result <- data %>%
        group_by(!!!groups) %>%
        summarise(sum1 = sum(!!sumvar1, !!sumvar2))
    return(result)
}

summarizefunction(df, col1, col2, sumvar1 = col4, sumvar2 = col5)

You can use the enquo keyword to wrap quote parameters which prevent them from being evaluated immediately. This you can use the !! (called bang bang) operator to unquote the parameter. I think this is the most flexible and reusable solution, even when you have to write some more initial code.