alan-y / blogdown-website

This is my personal website and blog built using the blogdown R package and deployed with Netlify.
https://alan-y.netlify.app
0 stars 0 forks source link

Filtering with string statements in dplyr | Alan Yeung #2

Open utterances-bot opened 4 years ago

utterances-bot commented 4 years ago

Filtering with string statements in dplyr | Alan Yeung

A question came up recently at work about how to use a filter statement entered as a complete string variable inside dplyr’s filter() function – for example dplyr::filter(my_data, "var1 == 'a'"). There does not seem to be much out there on this and I was not sure how to do it either but luckily jakeybob had a neat solution that seems to work well. some_data %>% filter(eval(rlang::parse_expr(selection_statement))) Let’s see it in action using the iris flowers dataset.

https://alan-y.netlify.app/post/filtering-string-dplyr/

Nagaraj-Pudukotai commented 4 years ago

The reason it works is due to what is known as non standard evaluation in R. tidyverse exploits this feature very well. To understand more about it google non standard evaluation (NSE) and read any articles on the topic from R Studio or Hadley Whickam.

rokasklydzia commented 4 years ago

I wonder if this expression would work inside select() or pivot_longer() statements...

lionel- commented 4 years ago

This is a common idiom among data.table users. I would use eval() instead of eval_tidy() because quosures won't work properly with this approach.

The expressions in your example should work in most cases because they only involve data frame columns and syntactic literals which are valid in any environment and don't risk masking by data frame columns. This would be more tricky if the expressions involved variables from the environment.

The main problem with building expressions with strings is that you need to worry about special characters, escaping, etc. I would use this approach instead:

var <- "Species"
value <- "setosa"

iris2 %>% filter(.data[[var]] == .env$value)
#> # A tibble: 50 x 5
#>   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#>          <dbl>       <dbl>        <dbl>       <dbl> <fct>
#> 1          5.1         3.5          1.4         0.2 setosa
#> 2          4.9         3            1.4         0.2 setosa
#> 3          4.7         3.2          1.3         0.2 setosa
#> 4          4.6         3.1          1.5         0.2 setosa
#> # … with 46 more rows

You can generalise it for different comparison operations:

var <- "Sepal.Width"
value <- 4
cmp <- `>`

iris2 %>% filter(cmp(.data[[var]], .env$value))
#> # A tibble: 3 x 5
#>   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#>          <dbl>       <dbl>        <dbl>       <dbl> <fct>
#> 1          5.7         4.4          1.5         0.4 setosa
#> 2          5.2         4.1          1.5         0.1 setosa
#> 3          5.5         4.2          1.4         0.2 setosa

In a function:

my_filter <- function(data, var, cmp, value) {
  data %>% filter(cmp(.data[[var]], .env$value))
}

iris2 %>% my_filter("Species", `==`, "setosa")
#> # A tibble: 50 x 5
#>   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#>          <dbl>       <dbl>        <dbl>       <dbl> <fct>
#> 1          5.1         3.5          1.4         0.2 setosa
#> 2          4.9         3            1.4         0.2 setosa
#> 3          4.7         3.2          1.3         0.2 setosa
#> 4          4.6         3.1          1.5         0.2 setosa
#> # … with 46 more rows
glsnow commented 4 years ago

I am starting with an analogy, so stick with me, it will get around to R and parse.

My employer requires all receipts for expense reimbursement to be submitted in a pdf file (I expect most employers are similar these days). At one point the official instructions for an electronic receipt was to print out the receipt on paper, then scan that paper to get the pdf file. I instead found tools that let me print directly to a pdf file and combine multiple pdf files into one. This saved some paper, some trips to the printer/scanner, and resulted in better quality.

Using parse type functions in R is much like printing something from an electronic file, then scanning it into an electronic file. It technically works, but it is the long way around in most cases (there are exceptions if your name is Hadley and you really know what you are doing along with the possible dangers and how to deal with them).

If the string to use in filtering will come from someone else (e.g. a text field in a shiny page) then you need to be very careful, someone could put any code in there including calls to system or related functions that could give them access to information that you don't want public, or even damage files on your system.

If you are creating the string yourself, then you are taking information, putting it into a string, then parsing the string to get back to the same information (like printing then scanning). It will be more efficient in the long run (as well as safer and easier to debug) to work with the original information instead of putting it into a string and parsing. You will probably need to use something other than filter.

See fortune(106) and fortune(181).

alan-y commented 4 years ago

Thanks very much for these explanations! @lionel- I had a feeling that something like what you've suggested would be the way to go with this but didn't know how it could be done @glsnow like the analogy and you've definitely made it clear some of the dangers of doing things eval() and parse(). I also checked out those fortunes thanks!

markolipka commented 3 years ago

It seems like this fails if the selection_statementis an empty string. How could one overcome this? Thanks!

lukr90 commented 3 years ago

has anyone managed to get the evaluation to work on lazy tibbles as well? like if iris was in a database: tbl(con, 'iris') %>% filter(...). I couldnt get this to work on a first try unfortunately

alan-y commented 3 years ago

@markolipka - if selection_statement is an empty string, you could get it to work by amending the rlang::parse_expr() function to deal with that special case, e.g.

library(tidyverse)
iris2 <- as_tibble(iris)

selection_statement <- ""

parse_expr2 <- function(x) {
  if (x != "") {
    rlang::parse_expr(x)
  } else {
    return(TRUE)
  }
}

iris2 %>% 
  filter(eval(parse_expr2(selection_statement)))

selection_statement <- "Species == 'setosa'"

iris2 %>% 
  filter(eval(parse_expr2(selection_statement)))
alan-y commented 3 years ago

@lukr90 - if iris is in a database, you can get this to work by not running eval() since dbplyr will evaluate later when it is a lazy tibble, e.g.

library(tidyverse)
library(RSQLite)
iris_db <-  dbplyr::tbl_memdb(iris)

selection_statement <- "Species == 'setosa'"

iris_db %>% 
  filter(rlang::parse_expr(selection_statement)) %>% 
  count(Species)
markolipka commented 3 years ago

@markolipka - if selection_statement is an empty string, you could get it to work by amending the rlang::parse_expr() function to deal with that special case, e.g.

library(tidyverse)
iris2 <- as_tibble(iris)

selection_statement <- ""

parse_expr2 <- function(x) {
  if (x != "") {
    rlang::parse_expr(x)
  } else {
    return(TRUE)
  }
}

iris2 %>% 
  filter(eval(parse_expr2(selection_statement)))

selection_statement <- "Species == 'setosa'"

iris2 %>% 
  filter(eval(parse_expr2(selection_statement)))

Great, thanks!

aweyant commented 2 years ago

Hello,

Thank you for sharing this solution. It is necessary for the work I am doing and have used it before.

I just noticed that this method fails if there is a comma in the string we feed as an argument.

For example...

quantile_filter <- "year %in% (1990:2019), month %in% (5:10)"

ghcnd_stations_df %>%
  filter(eval(rlang::parse_expr(quantile_filter)))

yields the following error:

Error in `filter()`:
! Problem while computing `..1 = eval(rlang::parse_expr(quantile_filter))`.
Caused by error in `parse()`:
! <text>:1:22: unexpected ','
1: year %in% (1990:2019),

This is fixed by using the equivalent (and more proper) specification of the filter conditions with "&" instead of a comma. I have taken to the habit of using commas because in the past, using "&" has led to bugs with no explanation. I did not care to explore these bugs, as I quickly fixed it by using a comma and moved on.

This is curious special case which may become more important if I release my working code as a package.

alan-y commented 2 years ago

Thanks @aweyant, yeah it doesn't work with the comma as you say but in case it's of interest, there is another way to get this to work with the unquote splice operator !!! and rlang::parse_exprs()

library(tidyverse)

iris2 <- as_tibble(iris)
selection_statement <- "Species == 'setosa'; Sepal.Length > 5"

iris2 %>% 
  filter(!!!rlang::parse_exprs(selection_statement))