Open davidagold opened 8 years ago
What about this going into the abstract tables interface instead? Then Jplyr can just call those generic methods. That would be more consistent with dataframes and DBs with specific SQL dialects can just implement the interface themselves.
Though it seems to require a dag/lazy eval, which brings us back to this package.
What about this going into the abstract tables interface instead?
Let's worry about that stuff later. There's so much work to be done here on just representing things with a DAG. Once that's done, it will be easier to address extensions.
Though I do think it is worth keeping the organization of AbstractTables vis a vis this package in the back of our minds. We may want to represent the notion of a tbl_lazy
in the type hierarchy, and that will require some figuring out.
@datnamer I'm not quite sure I understand the suggestion, and I'd like to understand it even though I agree it probably won't be relevant until some time later.
@davidagold - You pretty much got it there.
Following up on Part 2, to answer:
class(flights_sqlite) [1] "tbl_sqlite" "tbl_sql" "tbl_lazy" "tbl"
(TODO: understand how flights_sqlite comes to have this class structure.)
by tracing the sequence of calls from
> my_db <- src_sqlite("my_db.sqlite3", create = TRUE)
> library(nycflights13)
> flights_sqlite <- copy_to(my_db, flights, temporary = FALSE, indexes = c("year", "month", "day"), "carrier", "tailnum"))
Examining src_sqlite
:
src_sqlite <- function(path, create = FALSE) {
if (!requireNamespace("RSQLite", quietly = TRUE)) {
stop("RSQLite package required to connect to sqlite db", call. = FALSE)
}
if (!create && !file.exists(path)) {
stop("Path does not exist and create = FALSE", call. = FALSE)
}
con <- DBI::dbConnect(RSQLite::SQLite(), path)
RSQLite::initExtension(con)
src_sql("sqlite", con, path = path)
}
We see that my_db <- src_sqlite("my_db.sqlite3", create=TRUE)
initiates a connection to a database, before creating a src_sql
object:
src_sql <- function(subclass, con, ...) {
subclass <- paste0("src_", subclass)
structure(list(con = con, ...), class = c(subclass, "src_sql", "src"))
}
under the subclass src_sqlite
:
> str(my_db)
List of 2
$ con :Formal class 'SQLiteConnection' [package "RSQLite"] with 5 slots
.. ..@ Id :<externalptr>
.. ..@ dbname : chr "my_db.sqlite3"
.. ..@ loadable.extensions: logi TRUE
.. ..@ flags : int 6
.. ..@ vfs : chr ""
$ path: chr "my_db.sqlite3"
- attr(*, "class")= chr [1:3] "src_sqlite" "src_sql" "src"
We now trace the sequence of calls in
flights_sqlite <- copy_to(my_db, flights, temporary = FALSE, indexes = c("year", "month", "day"), "carrier", "tailnum"))
We begin with
copy_to.src_sql <- function(dest, df, name = deparse(substitute(df)),
types = NULL, temporary = TRUE,
unique_indexes = NULL, indexes = NULL,
analyze = TRUE, ...) {
assert_that(is.data.frame(df), is.string(name), is.flag(temporary))
class(df) <- "data.frame" # avoid S4 dispatch problem in dbSendPreparedQuery
if (isTRUE(db_has_table(dest$con, name))) {
stop("Table ", name, " already exists.", call. = FALSE)
}
types <- types %||% db_data_type(dest$con, df)
names(types) <- names(df)
con <- dest$con
db_begin(con)
on.exit(db_rollback(con))
db_create_table(con, name, types, temporary = temporary)
db_insert_into(con, name, df)
db_create_indexes(con, name, unique_indexes, unique = TRUE)
db_create_indexes(con, name, indexes, unique = FALSE)
if (analyze) db_analyze(con, name)
db_commit(con)
on.exit(NULL)
tbl(dest, name)
}
Brief aside:
deparse(substitute(⋅))
does the expected:> name <- deparse(substitute(flights)) > name [1] "flights"
Here's the cascade of calls from tbl(dest, name)
:
tbl.src_sqlite <- function(src, from, ...) {
tbl_sql("sqlite", src = src, from = from, ...)
}
tbl_sql <- function(subclass, src, from, ..., vars = attr(from, "vars")) {
make_tbl(
c(subclass, "sql", "lazy"),
src = src,
ops = op_base_remote(src, from, vars)
)
}
make_tbl <- function(subclass, ...) {
subclass <- paste0("tbl_", subclass)
structure(list(...), class = c(subclass, "tbl"))
}
which explains the hierarchy of classes "tbl_sqlite" "tbl_sql" "tbl_lazy" "tbl"
.
Let's wrap up the exploration with an examination of op_base_remote
:
op_base_remote <- function(src, x, vars = NULL) {
# If not literal sql, must be a table identifier
if (!is.sql(x)) {
x <- ident(x)
}
if (is.null(vars)) {
vars <- db_query_fields(src$con, x)
}
op_base("remote", src, x, vars)
}
where
db_query_fields()
should return a character vector giving the field names of a query,op_base <- function(name, src, x, vars) {
stopifnot(is.character(vars))
structure(
list(
src = src,
x = x,
vars = vars
),
class = c(paste0("op_base_", name), "op_base", "op")
)
}
Going forward, I think we should maybe do one/two more case studies on
sql_build
(which is generic over the lazy operations, lazy_ops, and generates an S3 object that represents the query.) andsql_render
(which takes a query object and then calls a function that is generic over the database.)For example, sql_build.op_mutate
generates a select_query
, and sql_render.select_query
calls sql_select
, which has different methods for different databases. I have been accumulating some notes, and might be able to free up some time to help work on a basic DAG -> SQL query string.
But before then, I think the final point of this exercise might be to decide whether to build up an (intermediate) "query object", that goes like
DAG ---
sql_build()
---> "query object" ---sql_render()
---> SQL string
and the factors that goes into such a decision.
Relevant reading(s):
Maybe it would be easiest to contact hadley himself and have a chat? I bet he would be willing to give some advice. If that ever happens, please record, because that would be a fascinating discussion to watch.
The purpose of this issue is to provide a description of the internals of dplyr's SQL generation. I intend it to serve as reference/inspiration for the development of such functionality in this package.
We first create a local SQLite3 database according to the vignette :
(We are not concerned with the generation of indices at this point.)
It is worth examining the structure of
flights_sqlite
object:We will see that the data manipulation commands provided by dplyr dispatch on the
tbl_lazy
class inflights_sqlite
's class structure and are accumulated in itsops
field.For instance, let's trace the cascade of functions resulting from a call to
The adventure starts in R/manip.R:
The filtering expressions (in this case
year == 2013
,month == 1
,day == 1
) are passed tolazyeval
, which produces the followinglazy_dots
objectwhich encodes the information (i.e. the un-evaluated filtering expressions) passed to the original
filter
command. The original data source, i.e.flights_sqlite
and.dots
are then passed to the eagerly-evaluatingfilter_
generic.To find the specific method called on
(flights_sqlite, .dots)
, we need to identify the class structure offlights_sqlite
:(TODO: understand how
flights_sqlite
comes to have this class structure.) The relevant class, i.e. the one on whichfilter_
dispatches, turns out to betbl_lazy
, and so we examine the relevant method given in R/tbl_lazy.R:Since the
...
argument is empty, thedots
object returned by thelazyeval::all_dots
call is "equivalent" (i.e., has the same structure and components) to the original.dots
argument.The purpose of
filter_.tbl_lazy
is ultimately to collect the information about the filtering expressions and store it in anop
object, which is in turn stored in theops
field of the originaltbl_lazy
data source, i.e.flights_sqlite
. This takes us to the implementation ofadd_op_single
in the land ofR/lazy_ops.R
:(We note that the
ops
field offlights_sqlite
is initiated with theop_base_remote
method).op_single
in turn returns anop
object whose most specific class is determined by the manipulation command from which the call toadd_op_single
originated:Thus we find that
add_op_single
(the "single" refers to the fact thatfilter
is a single-table verb) modifies theops
field of the.data
argument, i.e.flights_sqlite
. In particular, sinceflights_sqlite$ops
is passed as thex
argument toop_single
, the resultantop_filter
object "points to" the originalop_base_remote
object initialized at creation offlights_sqlite
. In this way dplyr lazily accumulates manipulation commands applied to remote data sources in a graph structure that can then be analyzed, flattened, and translated into SQL.Indeed, if we examine the
ops
field ofc1
, we see that its value is anop_filter
object that contains both the originalop_base_remote
object as well as all the requisite information about the filtering expressions passed tofilter
above:To be continued.