Some thoughts about improving the current implementation

Problem

Currently, the SQL data frame does not support all aspects of the DataFrame API, which makes it difficult to plug and play into existing workflows. For example, if I were to add a new column, it currently fails. (In fact, the messages suggest that it's trying to save the entire data frame into a new SQLite file, which could be very inefficient for large tables.)

library(SQLDataFrame)
example(SQLDataFrame, echo=FALSE)
obj$foo <- runif(nrow(obj))
## Error in coerce2(c(x, value), x) : 
##   coercion of DFrame object to SQLDataFrame didn't preserve its
##   dimensions

The same goes for other operations not currently/correctly supported by SQLDataFrame:

# Other operations also don't work as expected
colnames(obj) <- LETTERS[1:5] # does nothing
rownames(obj) <- obj$state # does nothing
cbind(obj, obj) # realizes everything into memory
obj$state <- tolower(obj$state) # fails
## Error in (function (classes, fdef, mtable)  : 
##   unable to find an inherited method for function ‘replaceCOLS’ for signature ‘"SQLDataFrame"’

Proposed solution

After briefly working on https://github.com/LTLA/ParquetDataFrame, I can start to see some common design patterns for out-of-memory DataFrame representations. So far, the most useful one has been to define a column vector class based on DelayedArray, where column access returns a delayed vector instead of realizing the column's contents in memory.

This is mildly useful for end users as it defers any realization in limited-memory scenarios. However, its true value lies in providing a coherent response when the user does something to a ParquetDataFrame that causes it to no longer be consistent with the underlying Parquet file, e.g., adding a new column, replacing a column's contents, binding it with another DF... In such cases, the ParquetDataFrame collapses to a DFrame of ParquetColumnVector objects, providing users with the full DFrame functionality while preserving the file-backed nature of each surviving column.

The same philosophy can be applied to SQLDataFrames. For example, if I want to add a new column that wasn't present in the SQL table, we would collapse the SQLDataFrame to a DFrame of SQLColumnVectors plus the new column. This ensures that the full DataFrame API is supported, avoids any realization of data from the SQL file, and avoids any writes to the file system. We could also handle all the other operations, e.g., row/colname setting, cbinding, column mutation, and so on, some of which would manifest as DelayedArray-implemented operations on the SQLColumnVector objects.

Implementation

So, on that note, I wrote a prototype of a SQL-based DelayedArray vector:

library(RSQLite)
library(DelayedArray)

setClass("SqliteColumnSeed", slots=c(
    path="character",
    table="character",
    column="character",
    length="integer",
    type="character"
))

SqliteColumnSeed <- function(path, table, column, length=NULL, type=NULL) {
    if (is.null(length) || is.null(type)) {
        con <- dbConnect(RSQLite::SQLite(), path)
        if (is.null(type)) {
            out <- dbGetQuery(con, paste0("SELECT ", column, " FROM ", table, " LIMIT 1"))
            type <- typeof(out[,column])
        }
        if (is.null(length)) {
            length <- dbGetQuery(con, paste0("SELECT COUNT(", column, ") FROM ", table))[,1]
        }
    }
    new("SqliteColumnSeed", path=path, table=table, column=column, length=length, type=type)
}

setMethod("dim", "SqliteColumnSeed", function(x) x@length)

setMethod("type", "SqliteColumnSeed", function(x) x@type)

setMethod("extract_array", "SqliteColumnSeed", function(x, index) {
    con <- dbConnect(RSQLite::SQLite(), x@path)

    i <- index[[1]]
    if (is.null(i)) {
        res <- dbGetQuery(con, paste0("SELECT ", column, " FROM ", table))
    } else {
        dbWriteTable(con, "tmp_indices", data.frame(indices=i), temporary=TRUE)
        res <- dbGetQuery(con, sprintf(
            "SELECT x.%s FROM
                (SELECT %s, ROW_NUMBER () OVER (ORDER BY 1) AS row FROM %s) x
             INNER JOIN tmp_indices ON tmp_indices.indices = x.row", 
             x@column, x@column, x@table)
        )
    }

    array(res[,x@column])
})

setClass("SqliteColumnVector", contains="DelayedArray", slots=c(seed="SqliteColumnSeed"))

setMethod("DelayedArray", "SqliteColumnSeed", function(seed) new("SqliteColumnVector", seed=seed))

There are a couple of rough corners in the code above, but hopefully you get the main idea.

Usage

This is as simple as:

# Mocking up a read-only DF file
con <- dbConnect(RSQLite::SQLite(), "test.sqlite") 
dbWriteTable(con, "mtcars", mtcars)
system("chmod 444 test.sqlite")

# Creating an instance.
seed <- SqliteColumnSeed("test.sqlite", "mtcars", "carb")
col <- DelayedArray(seed)
## <32> SqliteColumnVector object of type "double":
##  [1]  [2]  [3]    . [31] [32] 
##    4    4    1    .    8    2 

# We could make a whole DFrame:
df <- DataFrame(carb = col) 
## DataFrame with 32 rows and 1 column
##                     carb
##     <SqliteColumnVector>
## 1                      4
## 2                      4
## 3                      1
## 4                      1
## 5                      2
## ...                  ...
## 28                     2
## 29                     4
## 30                     6
## 31                     8
## 32                     2

Then, the SQLDataFrame would just be a collection of SqliteColumnVector objects. No need to reproduce the delayed machinery - let DelayedArray take care of all of that when operations are applied to each column, e.g., row subsetting:

df[1:10,,drop=FALSE]
## DataFrame with 10 rows and 1 column
##              carb
##    <DelayedArray>
## 1               4
## 2               4
## 3               1
## 4               1
## 5               2
## 6               1
## 7               4
## 8               2
## 9               2
## 10              4

More ambitious developers could even specialize [ when x is a SQLDataFrame and i is a SQLColumnVector to perform some predicate pushdown for greater efficiency. Though this may be a fragile performance boost as most Bioconductor data structures will probably normalize the subscripts to a regular vector before passing it on to components.

Bioconductor / SQLDataFrame