Open derekholmes opened 7 years ago
Thanks, it's good to see that many people benefit from the fst
package in their work! Your request is somewhat similar to requests made in issue #12, I believe a first step could be to use R's internal serialization mechanism for serializing 'complex types' but use the LZ4 and ZSTD compressors instead of the default compressors for speed. In that case, you would still have random row access to elements in list-type columns. Later, I could also optimize further by using fst
serialization for list elements of known types inside the list-type columns (recursively), increasing speed further.
Thanks for the request, it's definitely on the list for one of the next versions.
FWIW, here is the code I wrote to do this. I have a function called cAssign() which takes a possible list of dataframes and either assigns them to a different environment and/or saves them to disk. (This is self-rolled persistence.) The name is passed in as a string, and if one of the data frames is large enough, it is split into a separate file using fst.
cAssign<-function(x,dbg=TRUE,silent=FALSE,copysilent=FALSE,trace=FALSE,dpath=datapath,nbig=10000,title="",usefst=TRUE) {
ppp=lapply(x,function(y){
fname=paste0(dpath,paste0(y,".RD"))
if(usefst) {
cadtmp=get(y,pos=parent.frame(n=3))
if("list" %in% class(cadtmp)) {
listonames = c(names(cadtmp),paste0("A",1:length(cadtmp)))[1:length(cadtmp)]
for(i in 1:length(listonames)) {
if("data.frame" %in% class(cadtmp[[i]]) && nrow(cadtmp[[i]])>=nbig) {
message(" Splitting ",listonames[[i]], " from ",y)
newfilename=paste0(y,"_",listonames[[i]],".fst")
write.fst(cadtmp[[i]], paste0(dpath,newfilename),compress=20)
cadtmp[[i]]=newfilename
}
}
}
e1<-new.env()
assign(y,cadtmp,env=e1)
save(list=c(y),envir=e1,file=fname) }
else {
save(list=c(y),file=fname) }
if(!copysilent) { message("GlobalAssign and Saving ", y, " to disk as ",y,".RD (filesize:",file.size(fname),")") }
})
}
Hi @derekholmes, thanks for sharing! So basically you need to store a list with several components, the largest of which are data.table
's. You would like fst
to be able to store a list and if a list element is a data.table
(or vector), still have random access to that structure?
Supporting list
s would certainly be possible. For storing a table with random access inside that list
the fst
format would need to support nested structures. That would be a very interesting and useful feature I think. The current format could be maintained as is, but when you need a list
, you can use a single column data.table
containing 1 column of the list
type. The same holds for vectors.
The speed of a nested list
structure would probably be lower due to additional file-pointer jumps, but when the data.table
elements are comparatively large, the effect would be small.
Thanks for your feature request, when the list
type is implemented, I'll make sure that the format is prepared for recursive structures as well!
This is a great package -- halved my data loading time by half, but with some effort. I frequently group data into lists (e.g. a time series "dataset" with data in a data.table, inventory in a small data.frame and xts dates/representations) of the form mydata=list("x"=data.table(..), "y"=data.table, "z" = chr) etc.
I was able to write a wrapper around these to parse component datatables to separate .fst files, but it would be great if you generalized the read and write to more general data structures. Eventually, I think this can really be a replacement for save and load.