fstpackage / fsttable

An interface to fast on-disk data tables stored with the fst format
GNU Affero General Public License v3.0
27 stars 4 forks source link

Move state-update ability from `datatableinterface` into `table_proxy` #23

Open martinblostein opened 6 years ago

martinblostein commented 6 years ago

Currently the datatableinterface inherits from data.table, and stores its table_proxy object in a data.table cell. This allows the table_proxy object to be updated in-place. I believe this functionality should be moved into the table_proxy object because (1) it is not specific to the data.table interface and (2) any function intended update the state of a table_proxy needs to take the entire datatableinterface object as an argument, which seems a bit unwieldy.

The new arrangement would allow:

table_proxy_update_state(tbl_proxy, new_state) {
   tbl_proxy[, remote_table_state := new_state]
}

Or, assuming that the remote_table_state field it is the only part of the table_proxy that ever needs to be updated, that field itself could inherit from data.table, which would allow functions along these lines:

table_proxy_select_cols(tbl_proxy, selection) {
  tbl_proxy$remote_table_state[, colnames := colnames[selection]]
  tbl_proxy$remote_table_state[, ncol := length(selection)]
  tbl_proxy$remote_table_state[, coltypes := coltypes[selection]]
}

If this makes sense, I can have a go at implementing it.

MarcusKlik commented 6 years ago

Thanks, it definitely makes sense for the table_proxy object to be able to update it's own meta-data! For example, to facilitate := updates from the data.table interface.

When the table_proxy class and the remote_table generics go into their own package at some time in the future, it's probably best not to let that package depend on data.table. So we would need to have a native mechanism for keeping a self-reference.

Perhaps we can study the data.table package for that later mechanism later but use a data.table container for now. In short, yes please have a go :-) !

martinblostein commented 6 years ago

If we need self-reference and mutability, then I think reference classes might be the most straightforward answer. They are part of base R. If remote_table_state was an instance of RefClass RemoteTableState, we could have something like this:

RemoteTableState <- setRefClass(
  'RemoteTableState',

  fields = list(
    colnames = 'character',
    coltypes = 'character',
    ncol = 'integer'
  ),

  methods = list(
    initialize = function(colnames, coltypes) {
      .self$colnames <- colnames
      .self$coltypes <- coltypes
      .self$ncol <- length(colnames)
    },

    show = function() {
      cat('Remote Table State:\n')
      cat('Column Names:', .self$colnames, '\n')
      cat('Column Types:', .self$coltypes, '\n')
    },

    select_cols = function(selection) {
      .self$colnames <- .self$colnames[selection]
      .self$coltypes <- .self$coltypes[selection]
      .self$ncol     <- length(selection)
    }
  )
)

remote_table_state <- RemoteTableState(c('a', 'b'), c('int', 'int'))

print(remote_table_state)
# Remote Table State:
# Column Names: a b 
# Column Types: int int 

remote_table_state$select_cols(1)

print(remote_table_state)
# Remote Table State:
# Column Names: a 
# Column Types: int 
martinblostein commented 6 years ago

The reference classes aren't the most efficient, but I don't think this matters because we won't be keeping anything large in the remote state object, which is just part of the table proxy... Right?

MarcusKlik commented 6 years ago

Hi @martinblostein, thanks for that! Indeed, reference classes would solve the self-reference problem and provide for clear and encapsulated code. But it might be relatively slow. Perhaps that won't matter too much as you say, as we would only use it to provide a container for the meta-data in remote_table_state so that it can be updated from within the table_proxy methods.

On the other hand, using reference classes just to accommodate the := operator for the data.table interface (dplyr has no updates by reference) might be overkill. Perhaps using a special function would be enough:

# change list element by reference
set_element <- Rcpp::cppFunction("
  void set_element(SEXP list, SEXP element, int element_pos)
  {
    SET_VECTOR_ELT(list, element_pos, element);
  }
")

# some class with metadata
some_class <- function() {
  x <- list(
    some_data = 'mutable data',
    some_other_data = TRUE)
  class(x) <- "some_class"
  x
}

# method for updating the 'some data' field by reference
alter_data <- function(instance_of_some_class) {
  set_element(instance_of_some_class, 'altered', 0)
}

# create an instance of 'some_class'
instance <- some_class()
print(instance)
#> $some_data
#> [1] "mutable data"
#> 
#> $some_other_data
#> [1] TRUE
#> 
#> attr(,"class")
#> [1] "some_class"

# get memory locations of the R objects
pryr::inspect(instance)
#> <VECSXP 0x191ef690>
#>   <STRSXP 0x144ed600>
#>     <CHARSXP 0x18c266d8>
#>   <LGLSXP 0x144ed630>
#> ..some data removed here...

# alter the meta-data by reference
alter_data(instance)

# verify that no copy was made but the field address has changed
pryr::inspect(instance)
#> <VECSXP 0x191ef690>
#>   <STRSXP 0x144ed7e0>
#>     <CHARSXP 0x144e4d58>
#>   <LGLSXP 0x144ed630>
#> ..some data removed here...

# 'some_data' field has changed
print(instance)
#> $some_data
#> [1] "altered"
#> 
#> $some_other_data
#> [1] TRUE
#> 
#> attr(,"class")
#> [1] "some_class"

So the set_element method changes the list element by reference. I think this will be save to use, but we would have to investigate further to know for sure. If we know what we're doing, a set_element method would be a very fast method of updating the meta-data. But when the speed of reference classes proves to be of no concern, the reference class option would be a cleaner solution , what do you think?

MarcusKlik commented 6 years ago

Interesting side-note: don't use pryr::inspect() on an instance of a reference class, I just found that it get's into an infinite loop :-)

martinblostein commented 6 years ago

I like it! It allows us to carry on with whatever structure we like, without sacrificing the ability to deal with interfaces that use updates by reference. (I'm not crazy about mixing two object systems.)

However, I think I was a bit confused about the purpose of this update-by-reference functionality. If it is just for data.table :=, then perhaps it does make sense to keep it in the data.table interface?

MarcusKlik commented 6 years ago

Now that I'm thinking about it, perhaps it wouldn't be just for the data.table interface. We could split more complex queries into multiple smaller pieces for convenience. That would make the calls to the table_proxy simpler, e.g.:

ft2 <- ft[Year > 1960, .(A, B, C = 3 * D)]

could be split into:

Those two steps could be separate calls to the table_proxy. It would be nice to update only a small part of the remote_table_state in each step to avoid unnecessary copies of the whole structure. Also, when virtual columns are generated in multiple steps, the remote_table_state could be a growing structure.

Perhaps we can use the data.table solution for now (we have a data.table dependency already anyway) and put the set_element() solution up for a future enhancement (in a separate issue)?

martinblostein commented 6 years ago

Those two steps could be separate calls to the table_proxy. It would be nice to update only a small part of the remote_table_state in each step to avoid unnecessary copies of the whole structure. Also, when virtual columns are generated in multiple steps, the remote_table_state could be a growing structure. Perhaps we can use the data.table solution for now (we have a data.table dependency already anyway) and put the set_element() solution up for a future enhancement (in a separate issue)?

Sounds good to me!

(For completeness, another option that avoids both reference classes and R internals is to use plain environments, which are mutable and can be updated by reference. Reference classes are implemented using environments, but according to this page, plain environments are quite efficient.)

MarcusKlik commented 6 years ago

Yes, thanks, that might be an option too!

The solution that's simplest and easiest to maintain would be the best I guess because as you said earlier, the performance won't (most probably) be much of a problem because it's just a wrapper (and the performance critical parts won't depend on the specific choice of wrapper).