fstpackage / fst

Lightning Fast Serialization of Data Frames for R
http://www.fstpackage.org/fst/
GNU Affero General Public License v3.0
619 stars 41 forks source link

Serialization of list columns #20

Open derekholmes opened 7 years ago

derekholmes commented 7 years ago

This is a great package -- halved my data loading time by half, but with some effort. I frequently group data into lists (e.g. a time series "dataset" with data in a data.table, inventory in a small data.frame and xts dates/representations) of the form mydata=list("x"=data.table(..), "y"=data.table, "z" = chr) etc.

I was able to write a wrapper around these to parse component datatables to separate .fst files, but it would be great if you generalized the read and write to more general data structures. Eventually, I think this can really be a replacement for save and load.

MarcusKlik commented 7 years ago

Thanks, it's good to see that many people benefit from the fst package in their work! Your request is somewhat similar to requests made in issue #12, I believe a first step could be to use R's internal serialization mechanism for serializing 'complex types' but use the LZ4 and ZSTD compressors instead of the default compressors for speed. In that case, you would still have random row access to elements in list-type columns. Later, I could also optimize further by using fst serialization for list elements of known types inside the list-type columns (recursively), increasing speed further. Thanks for the request, it's definitely on the list for one of the next versions.

derekholmes commented 6 years ago

FWIW, here is the code I wrote to do this. I have a function called cAssign() which takes a possible list of dataframes and either assigns them to a different environment and/or saves them to disk. (This is self-rolled persistence.) The name is passed in as a string, and if one of the data frames is large enough, it is split into a separate file using fst.

cAssign<-function(x,dbg=TRUE,silent=FALSE,copysilent=FALSE,trace=FALSE,dpath=datapath,nbig=10000,title="",usefst=TRUE) {
   ppp=lapply(x,function(y){
      fname=paste0(dpath,paste0(y,".RD"))
      if(usefst) {
        cadtmp=get(y,pos=parent.frame(n=3))
        if("list" %in% class(cadtmp)) {
          listonames = c(names(cadtmp),paste0("A",1:length(cadtmp)))[1:length(cadtmp)]
          for(i in 1:length(listonames)) {
            if("data.frame" %in% class(cadtmp[[i]]) && nrow(cadtmp[[i]])>=nbig) {
               message(" Splitting ",listonames[[i]], " from ",y)
               newfilename=paste0(y,"_",listonames[[i]],".fst")
               write.fst(cadtmp[[i]], paste0(dpath,newfilename),compress=20)
               cadtmp[[i]]=newfilename
            }
           }
        }
        e1<-new.env()
        assign(y,cadtmp,env=e1)
        save(list=c(y),envir=e1,file=fname) }
      else {
        save(list=c(y),file=fname)  }
      if(!copysilent) { message("GlobalAssign and Saving ", y, " to disk as ",y,".RD (filesize:",file.size(fname),")") }
      })
   }
MarcusKlik commented 6 years ago

Hi @derekholmes, thanks for sharing! So basically you need to store a list with several components, the largest of which are data.table's. You would like fst to be able to store a list and if a list element is a data.table (or vector), still have random access to that structure?

Supporting lists would certainly be possible. For storing a table with random access inside that list the fst format would need to support nested structures. That would be a very interesting and useful feature I think. The current format could be maintained as is, but when you need a list, you can use a single column data.table containing 1 column of the list type. The same holds for vectors.

The speed of a nested list structure would probably be lower due to additional file-pointer jumps, but when the data.table elements are comparatively large, the effect would be small.

Thanks for your feature request, when the list type is implemented, I'll make sure that the format is prepared for recursive structures as well!