WinVector / wrapr

Wrap R for Sweet R Code
https://winvector.github.io/wrapr/
Other
135 stars 11 forks source link

How can you use dplyr::case_when progromatically with wrapr::let? #8

Closed rkingdc closed 5 years ago

rkingdc commented 5 years ago

I want to be able use dplyr::case_when to dynamically cut a database column similar to how base::cut might work. I can generate a function to do this with rlang (below), I find wrapr:let much more readable. How would this similar approach be done with let? Both the construction of the case_expr list and passing that list as the argument to case_when?

library(RSQLite)

cut_column_from_vector <- function(column_name, cut_vector){
    # get names in various formats 
    new_column_name   <- paste0(column_name, '_filter__')
    s_column_name     <- rlang::sym(column_name)
    s_new_column_name <- rlang::sym(new_column_name)

    # the vector shouldn't have names, but if it has them, use those names instead of the
    # canned ones then NULL out the names
    if (!is.null(names(cut_vector))){
        cut_names <- names(cut_vector)
        cut_vector <- unname(cut_vector)
    } else {
        cut_names <- cut_vector
    }

    # construct the object case_when needs to work 
    case_expr <- lapply(c(0, seq_along(cut_vector)), function(i){
        if (i == 0){
            lab <- sprintf('x<=%s', cut_names[i+1]) # a label
            rlang::expr(!!s_column_name <= cut_vector[!!i+1] ~ !!lab) # the expression
        } else if (i == length(cut_vector)) {
            lab <- sprintf('x>%s', cut_names[i])
            rlang::expr(!!s_column_name > cut_vector[!!i] ~ !!lab)
        } else {
            lab <- sprintf('%s<x<=%s', cut_names[i], cut_names[i+1])
            rlang::expr(!!s_column_name > cut_vector[!!i] & !!s_column_name <= cut_vector[!!i+1] ~ !!lab)
        }
    })

    # return the function
    return(function(data){
        dplyr::mutate(data, !!s_new_column_name := dplyr::case_when(!!!case_expr))
    })
}

# reprex
db <- dbConnect(SQLite(), ':memory:')
dbWriteTable(db, 'tbl_mtcars',  mtcars)
tbl_mtcars <- dplyr::tbl(db, 'tbl_mtcars')

cut_fn <- cut_column_from_vector('hp', c(100,200,300))
cut_fn(tbl_mtcars) # creates column hp_filter__

Apologies for the strange approach to generating a function that operates on the whole table--it makes sense in the context of the project.

JohnMount commented 5 years ago

Thanks for the note, I am glad you like wrapr::let().

The core point is wrapr::let() and rlang::!! are just tools for working around APIs that capture values by NSE instead of taking a value. rlang::!!! works around the additional API flaw of taking a series of values in ... instead of accepting a list.

Building up a large expression is one of the few places where the above issues occur naturally. Most of the other places one has such issue could have been avoided with better design. Our advice is design package APIs to minimize the above flaws. Though introducing such flaws and then being seen to fix them somehow is portrayed as exciting progress. case_when() is dplyr's baby, so only packages maintained by the same group will be able to reliably lash into its internals.

The concrete issue with your cut example is: your solution is building up an expression of undetermined size from pieces in an lapply(), and then finally using rlang:!!! to pass that to dplyr (which is designed to expect such). wrapr doesn't have a equivalent to !!!, so it won't be able to substitute for the last step of your solution.

A more database idiomatic way to deal with this is to treat the cut boundaries as data, and not try to translate them into code. That would look something like the following.

library("dplyr")
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

db <- DBI::dbConnect(RSQLite::SQLite(), ':memory:')
mtcars <- mtcars %>% 
  mutate(., id = seq_len(nrow(.)))
DBI::dbWriteTable(db, 'tbl_mtcars',  mtcars,
                  temporary = TRUE, overwrite = TRUE)
tbl_mtcars <- dplyr::tbl(db, 'tbl_mtcars')

cuts <- data.frame(cut = c(100,200,300))
DBI::dbWriteTable(db, 'tbl_cuts',  cuts,
                  temporary = TRUE, overwrite = TRUE)
tbl_cuts <- dplyr::tbl(db, 'tbl_cuts')

bin_choices <- tbl_mtcars %>%
  select(., id, hp) %>%
  left_join(., tbl_cuts, by = character(0)) %>%
  filter(., cut >= hp) %>%
  group_by(., id) %>%
  summarise(., hp_cut = min(cut, na.rm = TRUE))

left_join(tbl_mtcars, bin_choices, by = "id") %>%
  head(.)
#> # Source:   lazy query [?? x 13]
#> # Database: sqlite 3.22.0 [:memory:]
#>     mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb    id
#>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int>
#> 1  21       6   160   110  3.9   2.62  16.5     0     1     4     4     1
#> 2  21       6   160   110  3.9   2.88  17.0     0     1     4     4     2
#> 3  22.8     4   108    93  3.85  2.32  18.6     1     1     4     1     3
#> 4  21.4     6   258   110  3.08  3.22  19.4     1     0     3     1     4
#> 5  18.7     8   360   175  3.15  3.44  17.0     0     0     3     2     5
#> 6  18.1     6   225   105  2.76  3.46  20.2     1     0     3     1     6
#> # … with 1 more variable: hp_cut <dbl>

The derived column (the whole point of this calculation) was lost to tibble's formatting.

Unfortunately such a strategy will not work for dplyr for in-memory data.frames as the dplyr in-memory data.frame join is deficient.

library("dplyr")
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

mtcars <- mtcars %>% 
  mutate(., id = seq_len(nrow(.)))

cuts <- data.frame(cut = c(100,200,300))

bin_choices <- mtcars %>%
  select(., id, hp) %>%
  left_join(., cuts, by = character(0)) %>%
  filter(., cut >= hp) %>%
  group_by(., id) %>%
  summarise(., hp_cut = min(cut, na.rm = TRUE))
#> Error in left_join_impl(x, y, by_x, by_y, aux_x, aux_y, na_matches): `by` must specify variables to join by

We could work around the above by introducing a column that is a constant in each table and making these constant columns the join-by conditon.

rquery can implement the original pattern directly and reliably.

library("rquery")

raw_connection <- DBI::dbConnect(RSQLite::SQLite(), ':memory:')
dbopts <- rq_connection_tests(raw_connection)
db <- rquery_db_info(connection = raw_connection,
                     is_dbi = TRUE,
                     connection_options = dbopts)

mtcars <- mtcars
mtcars$id <- seq_len(nrow(mtcars))
db_mtcars <- rq_copy_to(db, "db_mtcars", mtcars,
                        temporary = TRUE, overwrite = TRUE)

cuts <- data.frame(cut = c(100,200,300))
db_cuts <- rq_copy_to(db, "db_cuts", cuts,
                      temporary = TRUE, overwrite = TRUE)

bin_choices <- db_mtcars %.>%
  select_columns(., c("id", "hp")) %.>%
  natural_join(., db_cuts, 
               jointype = "LEFT",
               by = character(0)) %.>%
  select_rows(., cut >= hp) %.>%
  project(., groupby = "id", 
          hp_cut = min(cut))

annotation_ops <- natural_join(db_mtcars, bin_choices, 
                               jointype = "LEFT",
                               by = "id")

execute(db, annotation_ops) %.>%
  head(.)
#>   id  mpg cyl disp  hp drat    wt  qsec vs am gear carb hp_cut
#> 1  1 21.0   6  160 110 3.90 2.620 16.46  0  1    4    4    200
#> 2  2 21.0   6  160 110 3.90 2.875 17.02  0  1    4    4    200
#> 3  3 22.8   4  108  93 3.85 2.320 18.61  1  1    4    1    100
#> 4  4 21.4   6  258 110 3.08 3.215 19.44  1  0    3    1    200
#> 5  5 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2    200
#> 6  6 18.1   6  225 105 2.76 3.460 20.22  1  0    3    1    200

And with rqdatatable the same code can be used on in-memory data.frames.

library("rqdatatable")

ex_data_table(annotation_ops, 
              tables = list(
                db_mtcars = mtcars,
                db_cuts = cuts)) %.>%
  head(.)
#>    id am carb cyl disp drat gear  hp  mpg  qsec vs    wt hp_cut
#> 1:  1  1    4   6  160 3.90    4 110 21.0 16.46  0 2.620    200
#> 2:  2  1    4   6  160 3.90    4 110 21.0 17.02  0 2.875    200
#> 3:  3  1    1   4  108 3.85    4  93 22.8 18.61  1 2.320    100
#> 4:  4  0    1   6  258 3.08    3 110 21.4 19.44  1 3.215    200
#> 5:  5  0    2   8  360 3.15    3 175 18.7 17.02  0 3.440    200
#> 6:  6  0    1   6  225 2.76    3 105 18.1 20.22  1 3.460    200
rkingdc commented 5 years ago

@JohnMount Thanks for your incredibly thorough answer! I'll investigate and benchmark your idea for treating cuts as data--although this specific use case is a shiny app that gets templated based on an underlying database. With an unknown number of columns and tables that will have cutpoints initialized at app startup plus giving the user the ability to interactively set them might become untenable for the app.

I really admire the design approach in your packages--keep up the fantastic work!

JohnMount commented 5 years ago

Wow, thanks for the nice comments.

The benchmarking issue will be: the cross-join (join with no meaningful conditions) creates a lot of rows (which is why I narrow the table and then join back in later). Building up a big expression is a legit strategy (I use it in the cdata package), so it will be interesting to compare.

JohnMount commented 5 years ago

Forgot to say. Benchmarking would be interesting. If you end up doing it drop me a line and we will link to it (regardless of what comes out on top).

rkingdc commented 5 years ago

I'm working on benchamarks--but I wan to make sure my rquery code is optimized. Here's the function I'm using:

db_rquery_join_fn <- function(data, cut_vector, db=db, column_to_cut = 'column_to_cut'){
    dbopts <- rquery::rq_connection_tests(db)
    rdb <- rquery::rquery_db_info(connection = db,
                                  is_dbi = TRUE,
                                  connection_options = dbopts)

    d <- data.frame(cut = cut_vector)
    db_cuts <- rq_copy_to(db, "db_cuts", d,
                          temporary = TRUE, overwrite = TRUE)

    bin_choices <- wrapr::let(c(COLTOCUT=column_to_cut), {
        data %.>%
        rquery::select_columns(., c("id", column_to_cut)) %.>%
            rquery::natural_join(., db_cuts, 
                     jointype = "LEFT",
                     by = character(0)) %.>%
            rquery::select_rows(., cut >= COLTOCUT) %.>%
            rquery::project(., groupby = "id", 
                cut_ = min(cut))
    })
    annotation_ops <- natural_join(data, bin_choices, 
                                   jointype = "LEFT",
                                   by = "id")
    execute(rdb, annotation_ops)

}

Anything I'm missing to make it faster?

Benchamarking code here:

library(dplyr)
library(rquery)
library(DBI)
library(microbenchmark)
make_data <- function(nrows, ncols=1){
    m <- matrix(sample.int(10000, nrows*ncols, replace=TRUE), nrow=nrows, ncol=ncols)
    df <- data.frame(m)
    df$id <- 1:nrows
    names(df)[1] <- "column_to_cut"
    df
}

NROWS <- 10000
v <- c(20,50,80)

db <- dbConnect(RSQLite::SQLite(), ":memory:")

d <- make_data(NROWS)
dbWriteTable(db, 'data', d, overwrite=TRUE, temporary=TRUE)
dplyr_data <- tbl(db, 'data')
rq_data <- db_td(db, 'data')

mb <- microbenchmark(
    rlang=db_rlang_case_fn(data = dplyr_data, cut_vector = v),
    dplyrjoin=db_dplyr_join_fn(data = dplyr_data, cut_vector = v, db=db),
    rqueryjoin=db_rquery_join_fn(data = rq_data, cut_vector=v, db=db),
    times=100
)

as it is rquery looks a little slower with this part. It'll be interesting to compare with more columns, more breaks, etc.

Unit: microseconds
       expr        min         lq       mean     median         uq       max neval
      rlang    951.376   1098.061   1579.812   1165.987   1243.438  18662.43   100
  dplyrjoin 356601.957 364861.487 375862.924 372508.525 382145.323 482375.61   100
 rqueryjoin 470101.583 483889.710 495341.638 493285.943 505781.688 541226.89   100
JohnMount commented 5 years ago

Well I think for the database example rquery should not be timed for the connection tests (they are just used once to characterize the database and build work-arounds for database issues) and copying the table to/from the database (usually one is working with a DB because that is where the database is). The final line on a pure-DB version would be materialize() not execute().

Also the intent of rquery is to build up the execution plan once and then re-use it, so that is a possible optimization. The exact same query can be used locally or on a database (as long as the table names match what is expected).

And RSQLite may not be a good database to test join timings, a PostgreSQL timing would also be interesting.

A pure in-R timing would be given by using rqdatatable. Are the rlang timings computing over a database also? I could see a big case statement being fast, but I was wondering what your rlang test was.

library(rqdatatable)
#> Loading required package: rquery
library(microbenchmark)

make_data <- function(nrows, ncols=1){
  m <- matrix(sample.int(10000, nrows*ncols, replace=TRUE), nrow=nrows, ncol=ncols)
  df <- data.frame(m)
  df$id <- 1:nrows
  names(df)[1] <- "column_to_cut"
  df
}

NROWS <- 10000
v <- c(20,50,80)
d <- make_data(NROWS)

d_cuts <- data.frame(cut = v)
col_to_cut <- "column_to_cut"

let(c(COL_TO_CUT = col_to_cut),
    bin_choices <- local_td(d) %.>%
      rquery::select_columns(., c("id", col_to_cut)) %.>%
      rquery::natural_join(., local_td(d_cuts), 
                           jointype = "LEFT",
                           by = character(0)) %.>%
      rquery::select_rows(., cut >= COL_TO_CUT) %.>%
      rquery::project(., groupby = "id", 
                      cut_ = min(cut)))
annotation_ops <- natural_join(local_td(d), bin_choices, 
                               jointype = "LEFT",
                               by = "id")

mb <- microbenchmark(
  rqdatatable = ex_data_table(annotation_ops, 
                              tables = list(d_cuts = d_cuts, d = d)),
  times=100
)
mb
#> Unit: milliseconds
#>         expr      min       lq     mean   median       uq      max neval
#>  rqdatatable 12.89655 15.14009 18.42598 17.39112 19.35595 53.88141   100

Created on 2019-02-27 by the reprex package (v0.2.1) That is averaging out to about 15140 microseconds, so about 10 times slower than the rlang timings you saw (though we are on different machines and all).

JohnMount commented 5 years ago

Also there is a way to write the code without let.

library(rqdatatable)
#> Loading required package: rquery
library(microbenchmark)

make_data <- function(nrows, ncols=1){
  m <- matrix(sample.int(10000, nrows*ncols, replace=TRUE), nrow=nrows, ncol=ncols)
  df <- data.frame(m)
  df$id <- 1:nrows
  names(df)[1] <- "column_to_cut"
  df
}

NROWS <- 10000
v <- c(20,50,80)
d <- make_data(NROWS)

d_cuts <- data.frame(cut = v)
col_to_cut <- as.name("column_to_cut")

bin_choices <- local_td(d) %.>%
  rquery::select_columns(., c("id", as.character(col_to_cut))) %.>%
  rquery::natural_join(., local_td(d_cuts), 
                       jointype = "LEFT",
                       by = character(0)) %.>%
  rquery::select_rows_se(., qe(cut >= .(col_to_cut))) %.>%
  rquery::project(., groupby = "id", 
                  cut_ = min(cut))
annotation_ops <- natural_join(local_td(d), bin_choices, 
                               jointype = "LEFT",
                               by = "id")

mb <- microbenchmark(
  rqdatatable = ex_data_table(annotation_ops, 
                              tables = list(d_cuts = d_cuts, d = d)),
  times=100
)
mb
#> Unit: milliseconds
#>         expr      min       lq    mean  median       uq      max neval
#>  rqdatatable 12.87766 13.46052 15.5771 14.2622 16.06964 32.41954   100
rkingdc commented 5 years ago

ok! I refactored the rquery function based on your feedback. I'll include the whole script below so you can pursue it, but the results are really interesting! I did my tests on a local PostgreSQL server (10.6).

When the number of cut points are low (3), rquery join is fastest, followed by dplyr::case_when (labeled rlang in microbenchmark) and then dplyr join.

### data with 100 rows
Unit: milliseconds
       expr      min       lq     mean   median       uq       max neval
      rlang 37.55146 39.52894 44.41826 43.83533 47.61188  58.32659   100
  dplyrjoin 61.23415 65.91865 70.40670 69.39216 74.54053  95.42232   100
 rqueryjoin 28.01681 29.97790 34.36187 32.55364 36.83064 128.60792   100

### data with 10,000 rows
Unit: milliseconds
       expr      min       lq     mean   median       uq       max neval
      rlang 57.27134 58.72057 65.18727 62.27854 70.42094  86.83360    30
  dplyrjoin 85.23575 88.82146 93.12072 91.91743 95.58685 114.52208    30
 rqueryjoin 48.67872 51.13598 55.95044 54.01054 58.74952  71.04822    30

When the number of cuts increase, the performance of case_when decreases and the rquery perfeocne is best, followed closely by dplyr.

# 1000 rows, 100 cuts
Unit: milliseconds
       expr       min        lq      mean    median        uq      max neval
      rlang 475.43697 492.51128 509.82625 501.19758 529.05239 601.3024   100
  dplyrjoin  85.86022  90.68527  97.16105  95.60870 100.96830 200.1967   100
 rqueryjoin  51.22859  53.52795  57.74979  55.53262  59.34091 111.6685   100

I also wanted to see how adding a index to the column that gets cut affects things. It ended up not really making a difference.

10k rows, 1k cuts
Unit: seconds
       expr      min       lq     mean   median       uq      max neval
      rlang 5.860360 5.905089 6.022183 5.972785 6.072321 6.655203    30
  dplyrjoin 2.369694 2.385239 2.438868 2.392000 2.500667 2.661496    30
 rqueryjoin 2.340883 2.356892 2.417487 2.380193 2.436352 2.878821    30
library(dplyr)
library(rquery)
library(DBI)
library(RMariaDB)
library(microbenchmark)

.make_rlang_case_fn <- function(column_name, cut_vector){
    # get names in various formats 
    s_column_name     <- rlang::sym(column_name)

    # the vector shouldn't have names, but if it has them, use those names instead of the
    # canned ones then NULL out the names
    if (!is.null(names(cut_vector))){
        cut_names <- names(cut_vector)
        cut_vector <- unname(cut_vector)
    } else {
        cut_names <- cut_vector
    }

    # construct the object case_when needs to work 
    case_expr <- lapply(c(0, seq_along(cut_vector)), function(i){
        if (i == 0){
            lab <- sprintf('%s', cut_names[i+1]) # a label
            rlang::expr(!!s_column_name <= cut_vector[!!i+1] ~ !!lab) # the expression
        } else if (i == length(cut_vector)) {
            lab <- sprintf('>max', cut_names[i])
            rlang::expr(!!s_column_name > cut_vector[!!i] ~ !!lab)
        } else {
            lab <- sprintf('%s', cut_names[i+1])
            rlang::expr(!!s_column_name > cut_vector[!!i] & !!s_column_name <= cut_vector[!!i+1] ~ !!lab)
        }
    })

    # return the function
    return(function(data){
        dplyr::compute(dplyr::mutate(data, cut_ := dplyr::case_when(!!!case_expr)))
    })
}

db_dplyr_join_fn <- function(data, tbl_cuts, column_to_cut="column_to_cut"){

        bin_choices <- data %>%
        select(., id, k_dummy, !!rlang::sym(column_to_cut)) %>%
        left_join(., tbl_cuts, by = 'k_dummy') %>%
        filter(., cut >= !!rlang::sym(column_to_cut)) %>%
        group_by(., id) %>%
        summarise(., cut_ = min(cut, na.rm = TRUE)) 
    return(dplyr::compute(dplyr::left_join(data, bin_choices, by = 'id')))
}

.make_db_rquery_join_fn <- function(data, tbl_cuts, db=db, column_to_cut = 'column_to_cut'){
    dbopts <- rquery::rq_connection_tests(db)
    rqdb <- rquery::rquery_db_info(connection = db,
                                  is_dbi = TRUE,
                                  connection_options = dbopts)

    bin_choices <- data %.>% 
        rquery::select_columns(., c("id", as.character(column_to_cut), 'k_dummy')) %.>% 
        rquery::natural_join(., tbl_cuts, jointype = "LEFT", by = 'k_dummy') %.>% 
        rquery::select_rows_se(., qe(cut >= column_to_cut)) %.>% 
        rquery::project(., groupby = "id", cut_ = min(cut)) 

    annotation_ops <- natural_join(data, bin_choices, jointype = "LEFT", by = "id")

    # to_sql(annotation_ops, db=rqdb)

    FUN = function(){
        rquery::materialize(db = rqdb, optree = annotation_ops)
    }
    return(FUN)
}

make_data <- function(nrows, ncols=1){
    m <- matrix(sample.int(10000, nrows*ncols, replace=TRUE), nrow=nrows, ncol=ncols)
    df <- data.frame(m)
    df$id <- 1:nrows
    names(df)[1] <- "column_to_cut"
    df$k_dummy <- 1
    df
}

run_benchmark <- function(nrows, ncuts, db, times=100, add_index=FALSE){

    ## setup ##
    d <- make_data(nrows)
    dbWriteTable(db, 'data', d, overwrite=TRUE, temporary=!add_index)

    if(add_index){
        DBI::dbExecute(db, 'CREATE INDEX idx ON data (column_to_cut)')
        DBI::dbExecute(db, 'ANALYZE data')
    }

    dplyr_data <- tbl(db, 'data')
    rq_data <- db_td(db, 'data')

    cut_vector <- seq(from=min(d$column_to_cut), to=max(d$column_to_cut), length.out=ncuts+2)
    cut_vector <- cut_vector[-c(1, length(cut_vector))]

    tmpdat <- data.frame(cut = floor(cut_vector), k_dummy=1)
    DBI::dbWriteTable(db, 
                      'tmp',  tmpdat,
                      temporary=TRUE, overwrite=TRUE)
    dplyr_cuts <- dplyr::tbl(db, 'tmp')
    rq_cuts <- db_td(db, 'tmp')

    # set up rland case fn
    db_rlang_case_fn <- .make_rlang_case_fn(column_name = 'column_to_cut', cut_vector=cut_vector)
    db_rquery_join_fn <- .make_db_rquery_join_fn(data=rq_data, tbl_cuts=rq_cuts,
                                                 db=db, column_to_cut = 'column_to_cut')    

    mb <- microbenchmark(
        rlang=db_rlang_case_fn(dplyr_data),
        dplyrjoin=db_dplyr_join_fn(data = dplyr_data, tbl_cuts=dplyr_cuts),
        rqueryjoin=db_rquery_join_fn(),
        times=times
    )

    if (add_index){
        DBI::dbExecute(db, 'TRUNCATE TABLE data')
    }
    return(mb)
}

db <- dbConnect(RPostgres::Postgres(), dbname='roz')
    pg_bm1e2.3 <- run_benchmark(nrows=100, ncuts=3, db=db, times=30)
    pg_bm1e3.3 <- run_benchmark(nrows=1e3, ncuts=3, db=db, times=30)
    pg_bm1e4.3 <- run_benchmark(nrows=1e4, ncuts=3, db=db, times=30)

    pg_bm1e3.10   <- run_benchmark(nrows=1e3, ncuts=10, db=db, times=30)
    pg_bm1e3.100  <- run_benchmark(nrows=1e3, ncuts=100, db=db, times=30)
    pg_bm1e3.1000 <- run_benchmark(nrows=1e3, ncuts=1000, db=db, times=30)

    pg_bm1e3.100.idx <- run_benchmark(nrows=1e3, ncuts=100, db=db, times=30, add_index=TRUE)
    pg_bm1e4.1000.idx <- run_benchmark(nrows=1e4, ncuts=1000, db=db, times=30, add_index=TRUE)
dbDisconnect(db)
JohnMount commented 5 years ago

This is in fact interesting. I am a little surprised the case-strategy isn't faster. But that is why we run experiments.

I am really hoping you will write this up and share it. I'll link to it. And that doesn't depend on rquery being fastest, earlier I was only commenting on where one code improve what was being measured (not trying to push rquery over the finish line).

I like how you factored out the set-up costs for both rquery and dplyr, that is a good use of both packages.

One abstract worry with DB benchmarks is the query optimizer may be caching a result and not re-running the query (interfering with timings). But I think PostgreSQL does not do that ( https://stackoverflow.com/questions/24252455/how-disable-postgresql-cache-optimization ). It looks like source tables eventually get swapped in (which is fine), but the query actually does get re-run.

If you need anything from me please reach out. My email is jmount@win-vector.com .

Some points that might be interesting to touch on (beyond timings). The case-structure is in fact very nice and rquery does not yet have such. The join strategy is perhaps a bit more DB-centric way of thinking of things (we pretend in an ideal database that replicating a row 4 times and filtering should cost about the same as applying a 4-clause query to a row; and then amazing DB implementations sometimes make this the case because they know people use databases that way).

rkingdc commented 5 years ago

Yeah, that was my initial thought as well, about query caching. I wanted to test it in MariaDB/MySQL too, since they do cache queries, but it's been a pain to get it to work (still debugging the dplyr methods--haven't even gotten to the rquery function).

I likely will write this up, and I'll drop you a line when I do. Thanks for your help!

JohnMount commented 5 years ago

I think for rquery::select_rows_se(., qe(cut >= column_to_cut)) to work correctly column_to_cut must be a name, not a string (as in as.name('column_to_cut')). Likely it is comparing to a string constant, not a column if the distinction is not made (sorry!). The problem is the .() substitution happens at a different level of code than let() substitution.

rkingdc commented 5 years ago

In this case, the name of the column to cut is in fact "column_to_cut". I think I made my code too confusing by adding that column name as an argument in some places and hard coded in others. In qe--it won't make a substitution if there is not bquote style enquoting going on, right?

JohnMount commented 5 years ago

It is my fault- rquery is database oriented so it goes to a lot of trouble to maintain the difference between using names to represent columns names and strings to represent string constants. bquote() substitution happens in the syntax tree, where that difference is preserved (the qe() is going through a different path), let() substitution happens closer to the source code level where the distinction is not maintained.

So I think for all the bquote() situations (including qe()) the variable must be of type name, not a string for filter condition to be correct. I say "think" as bquote() behavior is comes from bquote(), not from my own code .And yes without the ".()" symbols, no bquote style substitution should occur.

So I would definitely check all the queries are return equivalent results. I don't think the rquery one will be correct until one passes in column_to_cut as column_to_cut=as.name("column_to_cut").

Vignette on substitution in rquery.

JohnMount commented 5 years ago

Also, you have an interesting article with just the dplyr to dplyr comparisons if need be. I am trying to promote rquery and wrapr, but I don't want to slow down another good idea.

rkingdc commented 5 years ago

No need to shy away from a little promotion! I'm including rquery in my post because it's a really cool tool and I'd love to see more adoption there as well. I find the tidyverse/rlang approach poorly thought out--they should have taken more notes from data.table and bquote--so I'd love to see rquery and wrapr exposing those issues and forcing better thought about NSE.

Plus, as we see, rquery is faster in this instance, and that's useful! Thanks again for the help

rkingdc commented 5 years ago

Here's the post: https://www.rkingdc.com/blog/2019/2/28/binning-columns-in-remote-tables-with-dplyr-and-rquery

JohnMount commented 5 years ago

Neat, I'll call it out now. Thanks!

rkingdc commented 5 years ago

Thanks!