OHDSI / Andromeda

AsynchroNous Disk-based Representation of MassivE DAta: An R package aimed at replacing ff for storing large data objects.
https://ohdsi.github.io/Andromeda/
11 stars 9 forks source link

Generics for tbl_Andromeda #62

Open mvankessel-EMC opened 5 months ago

mvankessel-EMC commented 5 months ago

I'd like to propose some generics for tbl_Andromeda objects, mostly for QoL. For TreatmentPatterns I find myself wanting to get the nrows of a (filtered) table quite often. Currently tbl_Andromeda tables will give back NA when nrow(andromeda$iris) is called.

I currently do this:

library(Andromeda)
#> Loading required package: dplyr
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(dplyr)

a <- andromeda()

a$iris <- iris

a$iris %>%
  summarise(n = n()) %>%
  pull(n)
#> [1] 150

I setup some examples as to what generics we could override for andromeda tables:

nrow()

a$iris <- iris

nrow(a$iris)
#> [1] NA

# nrow is not a generic by default
nrow <- function(x) {
  UseMethod("nrow")
}

nrow.tbl_Andromeda <- function(x) {
  x %>%
    summarise(n = n()) %>%
    pull(n)
}

nrow(a$iris)
#> [1] 150

length()

length(iris)
#> [1] 5
length(a$iris)
#> [1] 2

length.tbl_Andromeda <- function(x) {
  ncol(x)
}

length(a$iris)
#> [1] 5

str()

str.tbl_Andromeda <- function(object) {
  object %>%
    head() %>%
    dplyr::collect() %>%
    str()
}

str(a$iris)
#> tibble [6 × 5] (S3: tbl_df/tbl/data.frame)
#>  $ Sepal.Length: num [1:6] 5.1 4.9 4.7 4.6 5 5.4
#>  $ Sepal.Width : num [1:6] 3.5 3 3.2 3.1 3.6 3.9
#>  $ Petal.Length: num [1:6] 1.4 1.4 1.3 1.5 1.4 1.7
#>  $ Petal.Width : num [1:6] 0.2 0.2 0.2 0.2 0.2 0.4
#>  $ Species     : chr [1:6] "setosa" "setosa" "setosa" "setosa" ...

`[`()

This one is probably overkill

`[.tbl_Andromeda` <- function(x, i, j) {
  x %>%
    select(all_of(j)) %>%
    filter(row_number() %in% i)
}

iris[c(1,2,3), c(1,2)]
#>   Sepal.Length Sepal.Width
#> 1          5.1         3.5
#> 2          4.9         3.0
#> 3          4.7         3.2
a$iris[c(1,2,3), c(1,2)]
#> # Source:   SQL [3 x 2]
#> # Database: sqlite 3.41.2 [C:\Users\MVANKE~1\AppData\Local\Temp\RtmpqwwhxC\file86f420a8698e.sqlite]
#>   Sepal.Length Sepal.Width
#>          <dbl>       <dbl>
#> 1          5.1         3.5
#> 2          4.9         3  
#> 3          4.7         3.2

There are probably more generics that would be useful.