fstpackage / fsttable

An interface to fast on-disk data tables stored with the fst format
GNU Affero General Public License v3.0
27 stars 4 forks source link

rbind/cbind two fsttable object or a fsttable with a data.frame or data.table #43

Open akrun1 opened 4 years ago

akrun1 commented 4 years ago

I would like to rbind two fsttable objects or a single fsttable with data.frame. What would be the preferred method?

library(fsttable)
 library(data.table)
 ft1 <- fst_table('1.fst')
 rbindlist(list(ft1, ft1[1:10]))
  .table_proxy X Y
  1: <tableproxy[2]> 0 0
  2: <tableproxy[2]> 0 0

For creating a new column/updating, I tried

 ft1[1:4, .(X)] *4
    X
1:  4
2:  8
3: 12
4: 16

If I update based on data.table methods, it is resulting in error

new <- (ft1[1:4, .(X)] * 4)[[1]]
ft1[1:4, new := new]
Error in parse_j(j, tbl_proxy$remotetablestate$colnames, parent.frame()) : 
  j must be a list

Is there a preferred method for modifying/updating columns? I did read some previous issues here and here. I just wonder if there are any updates for that. Thanks

PS: My objective is to update an already loaded fsttable object without converting to data.frame/data.table, add new rows and write it back as .fst file (after doing some join operations)

akrun1 commented 4 years ago

Tried comparing the read efficiency as well as select/subset between fsttable and tidyft. Both read the dataset (.fst) (10328208 x 35) very efficiently, but it is the later steps that is costly in tidyft. If there are ways in fsttable to do this efficiently, it would be great.

Screen Shot 2020-08-15 at 12 17 22 AM
MarcusKlik commented 4 years ago

Hi @akrun1, thanks for your feature request!

At the moment fsttable does not have rbindlist or cbind functionality unfortunately as it is in it's first experimental stages (and not actively developed at the moment). But it would certainly be a requirement for a fully functional data.table interface.

thanks, I'll add your issue as a feature request!

akrun1 commented 4 years ago

@MarcusKlik Thank you for the reply. I tried some of the packages (tidyft, arrow and disk.frame). One of the main advantages with your package fsttable is that it is so fast with slicing. With tidyft, as soon as I use select_fst and do some operations, it loses the advantage because it is pulling the data into memory. With disk.frame, I split up the data into multiple csv file, but it still takes a lot of time to read the data and put that into .fst files.