Bioconductor / SQLDataFrame

DataFrame representation of SQL database table
GNU Lesser General Public License v3.0
2 stars 3 forks source link

Make SQLDataFrame a concrete subclass of virtual class DataFrame #5

Closed hpages closed 2 years ago

hpages commented 2 years ago

See issue #4 for a discussion of this change.

Starting with S4Vectors 0.33.3, DataFrame is a virtual class with no listData, nrows, or rownames slot, making it possible for SQLDataFrame to become a concrete subclass of DataFrame.

The ultimate goal is to be able to use an SQLDataFrame object in any place where a DataFrame derivative is expected. For example inside a Vector derivative to store the metadata columns, or inside a SummarizedExperiment derivative to store the colData. With SQLDataFrame 1.9.1, this is now possible but with a big gotcha:

library(SQLDataFrame)
library(DBI)
dbfile <- system.file("extdata/test.db", package = "SQLDataFrame")
conn <- DBI::dbConnect(DBI::dbDriver("SQLite"), dbname = dbfile)
obj <- SQLDataFrame(conn = conn, dbtable = "state", dbkey = "state")
dim(obj)
# [1] 50  4
is(obj, "DataFrame")
# [1] TRUE

library(GenomicRanges)
gr <- GRanges("chr1", IRanges(1:50, 60))
mcols(gr) <- obj
gr
# GRanges object with 50 ranges and 4 metadata columns:
#        seqnames    ranges strand |           division        region population        size
#           <Rle> <IRanges>  <Rle> |        <character>   <character>  <numeric> <character>
#    [1]     chr1      1-60      * | East South Central         South       3615      medium
#    [2]     chr1      2-60      * |            Pacific          West        365       small
#    [3]     chr1      3-60      * |           Mountain          West       2280      medium
#    [4]     chr1      4-60      * | West South Central         South       2110      medium
#    [5]     chr1      5-60      * |            Pacific          West      21198       large
#    ...      ...       ...    ... .                ...           ...        ...         ...
#   [46]     chr1     46-60      * |     South Atlantic         South       4981      medium
#   [47]     chr1     47-60      * |            Pacific          West       3559      medium
#   [48]     chr1     48-60      * |     South Atlantic         South       1799      medium
#   [49]     chr1     49-60      * | East North Central North Central       4589      medium
#   [50]     chr1     50-60      * |           Mountain          West        376       small
#   -------
#   seqinfo: 1 sequence from an unspecified genome; no seqlengths

Looks good... BUT:

mcols(gr)
# Error in `dimnames<-`(`*tmp*`, value = dn) : 
#   invalid to use dimnames()<- on an S4 object of class 'SQLDataFrame'

Basically, this fails because SQLDataFrame objects don't support rownames<-:

rownames(obj) <- NULL
# Error in `dimnames<-`(`*tmp*`, value = dn) : 
#   invalid to use dimnames()<- on an S4 object of class 'SQLDataFrame'

Supporting rownames<- would go a long way in making SQLDataFrame objects drop-in replacements for DFrame objects.

H.

Liubuntu commented 2 years ago

HI @hpages Thanks for all the changes! I'll go through the package to refresh my memories, and see if anything else (except for the rownames) needed to be added (I've had a reimplement branch earlier so maybe adding more changes based on that).