Closed rkingdc closed 5 years ago
Thanks for the note, I am glad you like wrapr::let()
.
The core point is wrapr::let()
and rlang::!!
are just tools for working around APIs that capture values by NSE instead of taking a value. rlang::!!!
works around the additional API flaw of taking a series of values in ...
instead of accepting a list.
Building up a large expression is one of the few places where the above issues occur naturally. Most of the other places one has such issue could have been avoided with better design. Our advice is design package APIs to minimize the above flaws. Though introducing such flaws and then being seen to fix them somehow is portrayed as exciting progress. case_when()
is dplyr
's baby, so only packages maintained by the same group will be able to reliably lash into its internals.
The concrete issue with your cut example is: your solution is building up an expression of undetermined size from pieces in an lapply()
, and then finally using rlang:!!!
to pass that to dplyr
(which is designed to expect such). wrapr
doesn't have a equivalent to !!!
, so it won't be able to substitute for the last step of your solution.
A more database idiomatic way to deal with this is to treat the cut boundaries as data, and not try to translate them into code. That would look something like the following.
library("dplyr")
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
db <- DBI::dbConnect(RSQLite::SQLite(), ':memory:')
mtcars <- mtcars %>%
mutate(., id = seq_len(nrow(.)))
DBI::dbWriteTable(db, 'tbl_mtcars', mtcars,
temporary = TRUE, overwrite = TRUE)
tbl_mtcars <- dplyr::tbl(db, 'tbl_mtcars')
cuts <- data.frame(cut = c(100,200,300))
DBI::dbWriteTable(db, 'tbl_cuts', cuts,
temporary = TRUE, overwrite = TRUE)
tbl_cuts <- dplyr::tbl(db, 'tbl_cuts')
bin_choices <- tbl_mtcars %>%
select(., id, hp) %>%
left_join(., tbl_cuts, by = character(0)) %>%
filter(., cut >= hp) %>%
group_by(., id) %>%
summarise(., hp_cut = min(cut, na.rm = TRUE))
left_join(tbl_mtcars, bin_choices, by = "id") %>%
head(.)
#> # Source: lazy query [?? x 13]
#> # Database: sqlite 3.22.0 [:memory:]
#> mpg cyl disp hp drat wt qsec vs am gear carb id
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int>
#> 1 21 6 160 110 3.9 2.62 16.5 0 1 4 4 1
#> 2 21 6 160 110 3.9 2.88 17.0 0 1 4 4 2
#> 3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1 3
#> 4 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1 4
#> 5 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2 5
#> 6 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1 6
#> # … with 1 more variable: hp_cut <dbl>
The derived column (the whole point of this calculation) was lost to tibble
's formatting.
Unfortunately such a strategy will not work for dplyr
for in-memory data.frames as the dplyr
in-memory data.frame join is deficient.
library("dplyr")
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
mtcars <- mtcars %>%
mutate(., id = seq_len(nrow(.)))
cuts <- data.frame(cut = c(100,200,300))
bin_choices <- mtcars %>%
select(., id, hp) %>%
left_join(., cuts, by = character(0)) %>%
filter(., cut >= hp) %>%
group_by(., id) %>%
summarise(., hp_cut = min(cut, na.rm = TRUE))
#> Error in left_join_impl(x, y, by_x, by_y, aux_x, aux_y, na_matches): `by` must specify variables to join by
We could work around the above by introducing a column that is a constant in each table and making these constant columns the join-by conditon.
rquery
can implement the original pattern directly and reliably.
library("rquery")
raw_connection <- DBI::dbConnect(RSQLite::SQLite(), ':memory:')
dbopts <- rq_connection_tests(raw_connection)
db <- rquery_db_info(connection = raw_connection,
is_dbi = TRUE,
connection_options = dbopts)
mtcars <- mtcars
mtcars$id <- seq_len(nrow(mtcars))
db_mtcars <- rq_copy_to(db, "db_mtcars", mtcars,
temporary = TRUE, overwrite = TRUE)
cuts <- data.frame(cut = c(100,200,300))
db_cuts <- rq_copy_to(db, "db_cuts", cuts,
temporary = TRUE, overwrite = TRUE)
bin_choices <- db_mtcars %.>%
select_columns(., c("id", "hp")) %.>%
natural_join(., db_cuts,
jointype = "LEFT",
by = character(0)) %.>%
select_rows(., cut >= hp) %.>%
project(., groupby = "id",
hp_cut = min(cut))
annotation_ops <- natural_join(db_mtcars, bin_choices,
jointype = "LEFT",
by = "id")
execute(db, annotation_ops) %.>%
head(.)
#> id mpg cyl disp hp drat wt qsec vs am gear carb hp_cut
#> 1 1 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 200
#> 2 2 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 200
#> 3 3 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 100
#> 4 4 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 200
#> 5 5 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 200
#> 6 6 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1 200
And with rqdatatable
the same code can be used on in-memory data.frames.
library("rqdatatable")
ex_data_table(annotation_ops,
tables = list(
db_mtcars = mtcars,
db_cuts = cuts)) %.>%
head(.)
#> id am carb cyl disp drat gear hp mpg qsec vs wt hp_cut
#> 1: 1 1 4 6 160 3.90 4 110 21.0 16.46 0 2.620 200
#> 2: 2 1 4 6 160 3.90 4 110 21.0 17.02 0 2.875 200
#> 3: 3 1 1 4 108 3.85 4 93 22.8 18.61 1 2.320 100
#> 4: 4 0 1 6 258 3.08 3 110 21.4 19.44 1 3.215 200
#> 5: 5 0 2 8 360 3.15 3 175 18.7 17.02 0 3.440 200
#> 6: 6 0 1 6 225 2.76 3 105 18.1 20.22 1 3.460 200
@JohnMount Thanks for your incredibly thorough answer! I'll investigate and benchmark your idea for treating cuts as data--although this specific use case is a shiny app that gets templated based on an underlying database. With an unknown number of columns and tables that will have cutpoints initialized at app startup plus giving the user the ability to interactively set them might become untenable for the app.
I really admire the design approach in your packages--keep up the fantastic work!
Wow, thanks for the nice comments.
The benchmarking issue will be: the cross-join (join with no meaningful conditions) creates a lot of rows (which is why I narrow the table and then join back in later). Building up a big expression is a legit strategy (I use it in the cdata
package), so it will be interesting to compare.
Forgot to say. Benchmarking would be interesting. If you end up doing it drop me a line and we will link to it (regardless of what comes out on top).
I'm working on benchamarks--but I wan to make sure my rquery code is optimized. Here's the function I'm using:
db_rquery_join_fn <- function(data, cut_vector, db=db, column_to_cut = 'column_to_cut'){
dbopts <- rquery::rq_connection_tests(db)
rdb <- rquery::rquery_db_info(connection = db,
is_dbi = TRUE,
connection_options = dbopts)
d <- data.frame(cut = cut_vector)
db_cuts <- rq_copy_to(db, "db_cuts", d,
temporary = TRUE, overwrite = TRUE)
bin_choices <- wrapr::let(c(COLTOCUT=column_to_cut), {
data %.>%
rquery::select_columns(., c("id", column_to_cut)) %.>%
rquery::natural_join(., db_cuts,
jointype = "LEFT",
by = character(0)) %.>%
rquery::select_rows(., cut >= COLTOCUT) %.>%
rquery::project(., groupby = "id",
cut_ = min(cut))
})
annotation_ops <- natural_join(data, bin_choices,
jointype = "LEFT",
by = "id")
execute(rdb, annotation_ops)
}
Anything I'm missing to make it faster?
Benchamarking code here:
library(dplyr)
library(rquery)
library(DBI)
library(microbenchmark)
make_data <- function(nrows, ncols=1){
m <- matrix(sample.int(10000, nrows*ncols, replace=TRUE), nrow=nrows, ncol=ncols)
df <- data.frame(m)
df$id <- 1:nrows
names(df)[1] <- "column_to_cut"
df
}
NROWS <- 10000
v <- c(20,50,80)
db <- dbConnect(RSQLite::SQLite(), ":memory:")
d <- make_data(NROWS)
dbWriteTable(db, 'data', d, overwrite=TRUE, temporary=TRUE)
dplyr_data <- tbl(db, 'data')
rq_data <- db_td(db, 'data')
mb <- microbenchmark(
rlang=db_rlang_case_fn(data = dplyr_data, cut_vector = v),
dplyrjoin=db_dplyr_join_fn(data = dplyr_data, cut_vector = v, db=db),
rqueryjoin=db_rquery_join_fn(data = rq_data, cut_vector=v, db=db),
times=100
)
as it is rquery looks a little slower with this part. It'll be interesting to compare with more columns, more breaks, etc.
Unit: microseconds
expr min lq mean median uq max neval
rlang 951.376 1098.061 1579.812 1165.987 1243.438 18662.43 100
dplyrjoin 356601.957 364861.487 375862.924 372508.525 382145.323 482375.61 100
rqueryjoin 470101.583 483889.710 495341.638 493285.943 505781.688 541226.89 100
Well I think for the database example rquery
should not be timed for the connection tests (they are just used once to characterize the database and build work-arounds for database issues) and copying the table to/from the database (usually one is working with a DB because that is where the database is). The final line on a pure-DB version would be materialize()
not execute()
.
Also the intent of rquery
is to build up the execution plan once and then re-use it, so that is a possible optimization. The exact same query can be used locally or on a database (as long as the table names match what is expected).
And RSQLite
may not be a good database to test join timings, a PostgreSQL
timing would also be interesting.
A pure in-R
timing would be given by using rqdatatable
. Are the rlang
timings computing over a database also? I could see a big case statement being fast, but I was wondering what your rlang
test was.
library(rqdatatable)
#> Loading required package: rquery
library(microbenchmark)
make_data <- function(nrows, ncols=1){
m <- matrix(sample.int(10000, nrows*ncols, replace=TRUE), nrow=nrows, ncol=ncols)
df <- data.frame(m)
df$id <- 1:nrows
names(df)[1] <- "column_to_cut"
df
}
NROWS <- 10000
v <- c(20,50,80)
d <- make_data(NROWS)
d_cuts <- data.frame(cut = v)
col_to_cut <- "column_to_cut"
let(c(COL_TO_CUT = col_to_cut),
bin_choices <- local_td(d) %.>%
rquery::select_columns(., c("id", col_to_cut)) %.>%
rquery::natural_join(., local_td(d_cuts),
jointype = "LEFT",
by = character(0)) %.>%
rquery::select_rows(., cut >= COL_TO_CUT) %.>%
rquery::project(., groupby = "id",
cut_ = min(cut)))
annotation_ops <- natural_join(local_td(d), bin_choices,
jointype = "LEFT",
by = "id")
mb <- microbenchmark(
rqdatatable = ex_data_table(annotation_ops,
tables = list(d_cuts = d_cuts, d = d)),
times=100
)
mb
#> Unit: milliseconds
#> expr min lq mean median uq max neval
#> rqdatatable 12.89655 15.14009 18.42598 17.39112 19.35595 53.88141 100
Created on 2019-02-27 by the reprex package (v0.2.1)
That is averaging out to about 15140 microseconds, so about 10 times slower than the rlang
timings you saw (though we are on different machines and all).
Also there is a way to write the code without let.
library(rqdatatable)
#> Loading required package: rquery
library(microbenchmark)
make_data <- function(nrows, ncols=1){
m <- matrix(sample.int(10000, nrows*ncols, replace=TRUE), nrow=nrows, ncol=ncols)
df <- data.frame(m)
df$id <- 1:nrows
names(df)[1] <- "column_to_cut"
df
}
NROWS <- 10000
v <- c(20,50,80)
d <- make_data(NROWS)
d_cuts <- data.frame(cut = v)
col_to_cut <- as.name("column_to_cut")
bin_choices <- local_td(d) %.>%
rquery::select_columns(., c("id", as.character(col_to_cut))) %.>%
rquery::natural_join(., local_td(d_cuts),
jointype = "LEFT",
by = character(0)) %.>%
rquery::select_rows_se(., qe(cut >= .(col_to_cut))) %.>%
rquery::project(., groupby = "id",
cut_ = min(cut))
annotation_ops <- natural_join(local_td(d), bin_choices,
jointype = "LEFT",
by = "id")
mb <- microbenchmark(
rqdatatable = ex_data_table(annotation_ops,
tables = list(d_cuts = d_cuts, d = d)),
times=100
)
mb
#> Unit: milliseconds
#> expr min lq mean median uq max neval
#> rqdatatable 12.87766 13.46052 15.5771 14.2622 16.06964 32.41954 100
ok! I refactored the rquery function based on your feedback. I'll include the whole script below so you can pursue it, but the results are really interesting! I did my tests on a local PostgreSQL server (10.6).
When the number of cut points are low (3), rquery join is fastest, followed by dplyr::case_when (labeled rlang in microbenchmark) and then dplyr join.
### data with 100 rows
Unit: milliseconds
expr min lq mean median uq max neval
rlang 37.55146 39.52894 44.41826 43.83533 47.61188 58.32659 100
dplyrjoin 61.23415 65.91865 70.40670 69.39216 74.54053 95.42232 100
rqueryjoin 28.01681 29.97790 34.36187 32.55364 36.83064 128.60792 100
### data with 10,000 rows
Unit: milliseconds
expr min lq mean median uq max neval
rlang 57.27134 58.72057 65.18727 62.27854 70.42094 86.83360 30
dplyrjoin 85.23575 88.82146 93.12072 91.91743 95.58685 114.52208 30
rqueryjoin 48.67872 51.13598 55.95044 54.01054 58.74952 71.04822 30
When the number of cuts increase, the performance of case_when decreases and the rquery perfeocne is best, followed closely by dplyr.
# 1000 rows, 100 cuts
Unit: milliseconds
expr min lq mean median uq max neval
rlang 475.43697 492.51128 509.82625 501.19758 529.05239 601.3024 100
dplyrjoin 85.86022 90.68527 97.16105 95.60870 100.96830 200.1967 100
rqueryjoin 51.22859 53.52795 57.74979 55.53262 59.34091 111.6685 100
I also wanted to see how adding a index to the column that gets cut affects things. It ended up not really making a difference.
10k rows, 1k cuts
Unit: seconds
expr min lq mean median uq max neval
rlang 5.860360 5.905089 6.022183 5.972785 6.072321 6.655203 30
dplyrjoin 2.369694 2.385239 2.438868 2.392000 2.500667 2.661496 30
rqueryjoin 2.340883 2.356892 2.417487 2.380193 2.436352 2.878821 30
library(dplyr)
library(rquery)
library(DBI)
library(RMariaDB)
library(microbenchmark)
.make_rlang_case_fn <- function(column_name, cut_vector){
# get names in various formats
s_column_name <- rlang::sym(column_name)
# the vector shouldn't have names, but if it has them, use those names instead of the
# canned ones then NULL out the names
if (!is.null(names(cut_vector))){
cut_names <- names(cut_vector)
cut_vector <- unname(cut_vector)
} else {
cut_names <- cut_vector
}
# construct the object case_when needs to work
case_expr <- lapply(c(0, seq_along(cut_vector)), function(i){
if (i == 0){
lab <- sprintf('%s', cut_names[i+1]) # a label
rlang::expr(!!s_column_name <= cut_vector[!!i+1] ~ !!lab) # the expression
} else if (i == length(cut_vector)) {
lab <- sprintf('>max', cut_names[i])
rlang::expr(!!s_column_name > cut_vector[!!i] ~ !!lab)
} else {
lab <- sprintf('%s', cut_names[i+1])
rlang::expr(!!s_column_name > cut_vector[!!i] & !!s_column_name <= cut_vector[!!i+1] ~ !!lab)
}
})
# return the function
return(function(data){
dplyr::compute(dplyr::mutate(data, cut_ := dplyr::case_when(!!!case_expr)))
})
}
db_dplyr_join_fn <- function(data, tbl_cuts, column_to_cut="column_to_cut"){
bin_choices <- data %>%
select(., id, k_dummy, !!rlang::sym(column_to_cut)) %>%
left_join(., tbl_cuts, by = 'k_dummy') %>%
filter(., cut >= !!rlang::sym(column_to_cut)) %>%
group_by(., id) %>%
summarise(., cut_ = min(cut, na.rm = TRUE))
return(dplyr::compute(dplyr::left_join(data, bin_choices, by = 'id')))
}
.make_db_rquery_join_fn <- function(data, tbl_cuts, db=db, column_to_cut = 'column_to_cut'){
dbopts <- rquery::rq_connection_tests(db)
rqdb <- rquery::rquery_db_info(connection = db,
is_dbi = TRUE,
connection_options = dbopts)
bin_choices <- data %.>%
rquery::select_columns(., c("id", as.character(column_to_cut), 'k_dummy')) %.>%
rquery::natural_join(., tbl_cuts, jointype = "LEFT", by = 'k_dummy') %.>%
rquery::select_rows_se(., qe(cut >= column_to_cut)) %.>%
rquery::project(., groupby = "id", cut_ = min(cut))
annotation_ops <- natural_join(data, bin_choices, jointype = "LEFT", by = "id")
# to_sql(annotation_ops, db=rqdb)
FUN = function(){
rquery::materialize(db = rqdb, optree = annotation_ops)
}
return(FUN)
}
make_data <- function(nrows, ncols=1){
m <- matrix(sample.int(10000, nrows*ncols, replace=TRUE), nrow=nrows, ncol=ncols)
df <- data.frame(m)
df$id <- 1:nrows
names(df)[1] <- "column_to_cut"
df$k_dummy <- 1
df
}
run_benchmark <- function(nrows, ncuts, db, times=100, add_index=FALSE){
## setup ##
d <- make_data(nrows)
dbWriteTable(db, 'data', d, overwrite=TRUE, temporary=!add_index)
if(add_index){
DBI::dbExecute(db, 'CREATE INDEX idx ON data (column_to_cut)')
DBI::dbExecute(db, 'ANALYZE data')
}
dplyr_data <- tbl(db, 'data')
rq_data <- db_td(db, 'data')
cut_vector <- seq(from=min(d$column_to_cut), to=max(d$column_to_cut), length.out=ncuts+2)
cut_vector <- cut_vector[-c(1, length(cut_vector))]
tmpdat <- data.frame(cut = floor(cut_vector), k_dummy=1)
DBI::dbWriteTable(db,
'tmp', tmpdat,
temporary=TRUE, overwrite=TRUE)
dplyr_cuts <- dplyr::tbl(db, 'tmp')
rq_cuts <- db_td(db, 'tmp')
# set up rland case fn
db_rlang_case_fn <- .make_rlang_case_fn(column_name = 'column_to_cut', cut_vector=cut_vector)
db_rquery_join_fn <- .make_db_rquery_join_fn(data=rq_data, tbl_cuts=rq_cuts,
db=db, column_to_cut = 'column_to_cut')
mb <- microbenchmark(
rlang=db_rlang_case_fn(dplyr_data),
dplyrjoin=db_dplyr_join_fn(data = dplyr_data, tbl_cuts=dplyr_cuts),
rqueryjoin=db_rquery_join_fn(),
times=times
)
if (add_index){
DBI::dbExecute(db, 'TRUNCATE TABLE data')
}
return(mb)
}
db <- dbConnect(RPostgres::Postgres(), dbname='roz')
pg_bm1e2.3 <- run_benchmark(nrows=100, ncuts=3, db=db, times=30)
pg_bm1e3.3 <- run_benchmark(nrows=1e3, ncuts=3, db=db, times=30)
pg_bm1e4.3 <- run_benchmark(nrows=1e4, ncuts=3, db=db, times=30)
pg_bm1e3.10 <- run_benchmark(nrows=1e3, ncuts=10, db=db, times=30)
pg_bm1e3.100 <- run_benchmark(nrows=1e3, ncuts=100, db=db, times=30)
pg_bm1e3.1000 <- run_benchmark(nrows=1e3, ncuts=1000, db=db, times=30)
pg_bm1e3.100.idx <- run_benchmark(nrows=1e3, ncuts=100, db=db, times=30, add_index=TRUE)
pg_bm1e4.1000.idx <- run_benchmark(nrows=1e4, ncuts=1000, db=db, times=30, add_index=TRUE)
dbDisconnect(db)
This is in fact interesting. I am a little surprised the case-strategy isn't faster. But that is why we run experiments.
I am really hoping you will write this up and share it. I'll link to it. And that doesn't depend on rquery
being fastest, earlier I was only commenting on where one code improve what was being measured (not trying to push rquery
over the finish line).
I like how you factored out the set-up costs for both rquery
and dplyr
, that is a good use of both packages.
One abstract worry with DB benchmarks is the query optimizer may be caching a result and not re-running the query (interfering with timings). But I think PostgreSQL does not do that ( https://stackoverflow.com/questions/24252455/how-disable-postgresql-cache-optimization ). It looks like source tables eventually get swapped in (which is fine), but the query actually does get re-run.
If you need anything from me please reach out. My email is jmount@win-vector.com .
Some points that might be interesting to touch on (beyond timings). The case-structure is in fact very nice and rquery
does not yet have such. The join strategy is perhaps a bit more DB-centric way of thinking of things (we pretend in an ideal database that replicating a row 4 times and filtering should cost about the same as applying a 4-clause query to a row; and then amazing DB implementations sometimes make this the case because they know people use databases that way).
Yeah, that was my initial thought as well, about query caching. I wanted to test it in MariaDB/MySQL too, since they do cache queries, but it's been a pain to get it to work (still debugging the dplyr methods--haven't even gotten to the rquery function).
I likely will write this up, and I'll drop you a line when I do. Thanks for your help!
I think for rquery::select_rows_se(., qe(cut >= column_to_cut))
to work correctly column_to_cut
must be a name, not a string (as in as.name('column_to_cut')). Likely it is comparing to a string constant, not a column if the distinction is not made (sorry!). The problem is the .()
substitution happens at a different level of code than let()
substitution.
In this case, the name of the column to cut is in fact "column_to_cut". I think I made my code too confusing by adding that column name as an argument in some places and hard coded in others. In qe--it won't make a substitution if there is not bquote
style enquoting going on, right?
It is my fault- rquery
is database oriented so it goes to a lot of trouble to maintain the difference between using names to represent columns names and strings to represent string constants. bquote()
substitution happens in the syntax tree, where that difference is preserved (the qe()
is going through a different path), let()
substitution happens closer to the source code level where the distinction is not maintained.
So I think for all the bquote()
situations (including qe()
) the variable must be of type name
, not a string for filter condition to be correct. I say "think" as bquote()
behavior is comes from bquote()
, not from my own code .And yes without the ".()
" symbols, no bquote
style substitution should occur.
So I would definitely check all the queries are return equivalent results. I don't think the rquery
one will be correct until one passes in column_to_cut
as column_to_cut=as.name("column_to_cut")
.
Also, you have an interesting article with just the dplyr
to dplyr
comparisons if need be. I am trying to promote rquery
and wrapr
, but I don't want to slow down another good idea.
No need to shy away from a little promotion! I'm including rquery in my post because it's a really cool tool and I'd love to see more adoption there as well. I find the tidyverse/rlang approach poorly thought out--they should have taken more notes from data.table and bquote--so I'd love to see rquery and wrapr exposing those issues and forcing better thought about NSE.
Plus, as we see, rquery is faster in this instance, and that's useful! Thanks again for the help
Neat, I'll call it out now. Thanks!
Thanks!
I want to be able use
dplyr::case_when
to dynamically cut a database column similar to howbase::cut
might work. I can generate a function to do this withrlang
(below), I findwrapr:let
much more readable. How would this similar approach be done withlet
? Both the construction of thecase_expr
list and passing that list as the argument tocase_when
?Apologies for the strange approach to generating a function that operates on the whole table--it makes sense in the context of the project.