DS4PS / cpp-528-spr-2020

Course shell for CPP 528 Foundations of Data Science III for Spring 2020.
http://ds4ps.org/cpp-528-spr-2020/
2 stars 0 forks source link

Dynamic filtering with column names as strings #20

Open cenuno opened 4 years ago

cenuno commented 4 years ago

How do you create a function that is flexible with respect to column names using dplyr::filter() function?

The comparison logic inside the function is the same but the only thing different is the column name and value to be used in the comparison logic.

cenuno commented 4 years ago

Overview

This requires us to to be aware of the get() function, which refers to R objects by name rather than by value. When dplyr::filter() evaluates your code, you've never needed to specify the column names in quotes. However, now that you want the user to specify the columns of interest, we need to use get() to perform this type of dynamic filtering.

Sample code

Let's assume the filter_df() function down below exists in the analysis/functions/utlities.R file. Notice that the function contains high level documentation that tells the reader three things:

The more descriptive you can be with respect to object types in your argument section (e.g. df = data frame) and in your output, the easier it is for folks to understand what should be happening in your code. This is very similar to pseudo code in that it forces you to express what the logic should do (and in case the logic isn't working, you can share your code and allow others to offer help).

Here, filter_df() will return a data frame whose column_name values are exactly equal to value:

# create function ----
filter_df <- function(df, value, column_name) {
  # Return a df whose `column_name` values are exactly equal to `value`
  #
  # Arguments
  #   - df:           a data frame
  #   - value:        a number
  #   - column_name:  a character vector that represents a column
  #
  # Return
  #   - a data frame

  # filter the df based on records that contain values less than or equal
  # to the value in the given column_name.
  tmp_df <- dplyr::filter(df, 
                          # note: the use of get() refers to
                          #       R objects by name rather than value.
                          get(column_name) == value)

  # return to the Global Environment
  return(tmp_df)
}

The use of get() inside dplyr::filter() allows for this type of dynamic filtering

Now that we have our custom function, let's check it out (for more on source(), see #19):

# load necessary packages
library(dplyr)
library(here)

# load necessary functions
source(here("analysis/functions/utlities.R"))

# test the function
filter_df(iris, column_name = "Sepal.Width", value = 2)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
5 2 3.5 1 versicolor

Let's test the function again, this time using Petal.Width rather than Sepal.Width in the column_name argument:

filter_df(iris, column_name = "Petal.Width", value = 2)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
6.5 3.2 5.1 2 virginica
5.7 2.5 5.0 2 virginica
5.6 2.8 4.9 2 virginica
7.7 2.8 6.7 2 virginica
7.9 3.8 6.4 2 virginica
6.5 3.0 5.2 2 virginica

Conclusion

Notice that regardless if I used column_name="Sepal.Width" or column_name="Petal.Width", the filter_df() function was able to translate the string the user provided into column_name and convert it into an object via the get(column_name). The get() function ensures the string is read as an object rather than as a string, which is what dplyr::filter() requires in order to use it.

sunaynagoel commented 4 years ago

Thank you @cenuno