Open LTLA opened 1 year ago
Hi @LTLA
Thank you for your insights/suggestions, and my apologies for the delayed response. I will be dedicating my focus to the SQLDataFrame project, with the primary goal of adding support for DuckDB and incorporating your valuable suggestions. This work will be my main priority during December and January.
Best, Qian
Problem
Currently, the SQL data frame does not support all aspects of the
DataFrame
API, which makes it difficult to plug and play into existing workflows. For example, if I were to add a new column, it currently fails. (In fact, the messages suggest that it's trying to save the entire data frame into a new SQLite file, which could be very inefficient for large tables.)The same goes for other operations not currently/correctly supported by
SQLDataFrame
:Proposed solution
After briefly working on https://github.com/LTLA/ParquetDataFrame, I can start to see some common design patterns for out-of-memory
DataFrame
representations. So far, the most useful one has been to define a column vector class based on DelayedArray, where column access returns a delayed vector instead of realizing the column's contents in memory.This is mildly useful for end users as it defers any realization in limited-memory scenarios. However, its true value lies in providing a coherent response when the user does something to a
ParquetDataFrame
that causes it to no longer be consistent with the underlying Parquet file, e.g., adding a new column, replacing a column's contents, binding it with another DF... In such cases, theParquetDataFrame
collapses to aDFrame
ofParquetColumnVector
objects, providing users with the fullDFrame
functionality while preserving the file-backed nature of each surviving column.The same philosophy can be applied to
SQLDataFrame
s. For example, if I want to add a new column that wasn't present in the SQL table, we would collapse theSQLDataFrame
to aDFrame
ofSQLColumnVector
s plus the new column. This ensures that the fullDataFrame
API is supported, avoids any realization of data from the SQL file, and avoids any writes to the file system. We could also handle all the other operations, e.g., row/colname setting,cbind
ing, column mutation, and so on, some of which would manifest as DelayedArray-implemented operations on theSQLColumnVector
objects.Implementation
So, on that note, I wrote a prototype of a SQL-based
DelayedArray
vector:There are a couple of rough corners in the code above, but hopefully you get the main idea.
Usage
This is as simple as:
Then, the
SQLDataFrame
would just be a collection ofSqliteColumnVector
objects. No need to reproduce the delayed machinery - let DelayedArray take care of all of that when operations are applied to each column, e.g., row subsetting:More ambitious developers could even specialize
[
whenx
is aSQLDataFrame
andi
is aSQLColumnVector
to perform some predicate pushdown for greater efficiency. Though this may be a fragile performance boost as most Bioconductor data structures will probably normalize the subscripts to a regular vector before passing it on to components.