DyfanJones / noctua

Connect R to Athena using paws SDK (DBI Interface)
https://dyfanjones.github.io/noctua/
Other
45 stars 5 forks source link

grepl support for use with dbplyr #149

Closed ellmanj closed 2 years ago

ellmanj commented 2 years ago

There seems to be no implementation of grepl in the sql_translator for use with dbplyr.

For example:

> dplyr::tbl(db_connection, "concept") %>% filter(grepl('search_str', 'concept_name')) %>% show_query()
<SQL>
SELECT *
FROM "concept"
WHERE (grepl('search_str', "concept_name"))

Instead I expected the output to be something like this:

<SQL>
SELECT *
FROM "concept"
WHERE concept_name like '%search_str%'

Other backends for dbplyr include support for grepl, e.g. here is the postgres implementation.

DyfanJones commented 2 years ago

Ah sorry about that, I will take a look at it this weekend :)

ellmanj commented 2 years ago

Thank you for the quick response! Let me know if there is anything I can do to help.

DyfanJones commented 2 years ago

Athena can utilise these Pestro Regular Expression Functions https://prestodb.io/docs/current/functions/regexp.html. I wonder if it would beneficial to map them to R equivalents. 🤔

DyfanJones commented 2 years ago

Presto has a nice function regexp_like which seems to be fairly close to R's grepl and regexpr.

 regexp_like(string, pattern) → boolean#

    Evaluates the regular expression pattern and determines if it is contained within string.

    This function is similar to the LIKE operator, except that the pattern only needs to be contained within string, rather than needing to match all of string. In other words, this performs a contains operation rather than a match operation. You can match the entire string by anchoring the pattern using ^ and $:

I wonder if grepl should be mapped to like and regexr should be mapped to regexp_like. The pattern utilised is very much what you would used when using grepl in R.

ellmanj commented 2 years ago

Hmm, yeah seems like we could use regexp_like here.

In my current project I don't have a use-case for regexpr, and I don't see an implementation for it in other dbplyr backends such as Postgres. I wonder if that's because it's not as applicable in the context of SQL queries since it returns information about the location of the matching text and it's length. Whereas grepl just returned a boolean True/False on whether or not a match exists.

With regard to implementation for grepl, I wonder the best way to implement the ignore.case option. This post seems to show two potential ways, one using lower() and LIKE, and one using regexp_like(). I wonder if there's a difference in terms of performance...

DyfanJones commented 2 years ago

hi @ellmanj,

I have an initial implementation. I have also thrown in support for stringr and lubridate functions:

remotes::install_github("DyfanJones/noctua", ref = "sql_translate")
library(DBI)
library(dplyr)

con <- dbConnect(noctua::athena())

demo_iris <- iris
iris$date <- Sys.Date()

dbWriteTable(con, "demo_iris", iris)

demo_iris <- tbl(con, "demo_iris")

# return nothing
demo_iris %>%
  filter(grepl("Set", species))

demo_iris %>%
  filter(grepl("Set", species, ignore.case = T))
# Source:   lazy query [?? x 6]
# Database: Athena 0.1.11 [default@eu-west-1/default]
# sepal_length sepal_width petal_length petal_width species date      
# <dbl>       <dbl>        <dbl>       <dbl> <chr>   <date>    
# 1          5.1         3.5          1.4         0.2 setosa  2021-07-23
# 2          4.9         3            1.4         0.2 setosa  2021-07-23
# 3          4.7         3.2          1.3         0.2 setosa  2021-07-23
# 4          4.6         3.1          1.5         0.2 setosa  2021-07-23

# stringr example
demo_iris %>%
  mutate(str_loc = str_locate(species, "o"))
# Source:   lazy query [?? x 7]
# # Database: Athena 0.1.11 [default@eu-west-1/default]
# sepal_length sepal_width petal_length petal_width species date       str_loc
# <dbl>       <dbl>        <dbl>       <dbl> <chr>   <date>     <int64>
# 1          5.1         3.5          1.4         0.2 setosa  2021-07-23       4
# 2          4.9         3            1.4         0.2 setosa  2021-07-23       4
# 3          4.7         3.2          1.3         0.2 setosa  2021-07-23       4
# 4          4.6         3.1          1.5         0.2 setosa  2021-07-23       4

# lubridate example
demo_iris %>%
  mutate(day_of_week = wday(date, label = T))
# Source:   lazy query [?? x 7]
# Database: Athena 0.1.11 [default@eu-west-1/default]
# sepal_length sepal_width petal_length petal_width species date       day_of_week
# <dbl>       <dbl>        <dbl>       <dbl> <chr>   <date>     <chr>      
# 1          5.1         3.5          1.4         0.2 setosa  2021-07-23 Fri        
# 2          4.9         3            1.4         0.2 setosa  2021-07-23 Fri        
# 3          4.7         3.2          1.3         0.2 setosa  2021-07-23 Fri

dbRemoveTable(con, "demo_iris")

Please try it out and let me know.

DyfanJones commented 2 years ago

Hmm, yeah seems like we could use regexp_like here.

In my current project I don't have a use-case for regexpr, and I don't see an implementation for it in other dbplyr backends such as Postgres. I wonder if that's because it's not as applicable in the context of SQL queries since it returns information about the location of the matching text and it's length. Whereas grepl just returned a boolean True/False on whether or not a match exists.

With regard to implementation for grepl, I wonder the best way to implement the ignore.case option. This post seems to show two potential ways, one using lower() and LIKE, and one using regexp_like(). I wonder if there's a difference in terms of performance...

I'm not a 100% sure around performance. Might need to do a little digging. However as a side note you can use sql like in dplyr using the following:

demo_iris %>%
  filter(species %like% "%set%")

hope that gives another method you can use (%like% is already supported if you wish to use it now 😄 )

ellmanj commented 2 years ago

Please try it out and let me know.

Wow, thanks for getting on this so fast. I tried it out and it looks good to me!

DyfanJones commented 2 years ago

@ellmanj PR #150 has now been merged to master. I will update RAthena and then push the updates to the cran. In the mean time if you need to the fix please install github version:

remotes::install_github("DyfanJones/noctua")
DyfanJones commented 2 years ago

@ellmanj noctua and RAthena has been release to the cran. It should be accessible from the cran in the next couple of days