Open hope-data-science opened 4 years ago
Hi @hope-data-science , thanks for the feature request!
Having more options to manipulate and view characteristics of the offline dataset would be very useful indeed. But those can be better served in separate R
packages that import fst
for the low-level operations (such as the fstplyr
, fsttable
or your tidyfst
packages).
So fst
can provide the lower-level operations and access to meta-data while the downstream package can use those functionalities to provide functionality in their own specific API. Does that sound reasonable?
For example, fst
can provide the following low-level abilities:
R
operations on each groupDownstream packages could use these features to facilitate their own API's and provide functionality like offline sorting, partial loading, etc...
I am not so familiar with the implementations underneath, what you mention as "low-level abilities" are acutually quite "high-level" to me. If these abilities could be done in fst
, faster and memory efficient, I think that would be amazing! At the very first, my expectations are just:
How to access data more efficiently from fst file? How to subset data more flexibly (by group? filter? slice? select?[I think I've handled this part in some way] )?
I did make a function named filter_fst
, but that might not be fast. I think fst
could help to facilitate the access part very well. And about the computation part, if that can really be brought to us, that is a brand new revolution! I think that will open a new era to do computation out-of-memory, especially for some tough tasks.
BTW: A small problem, I am tring to get the zero row of fst table but failed. In data.frame or data.table, you can get DT[0,]
to get the column names and classes, this facilitates selection. Maybe fst table could do that too? Currently, I used ft[1,][0,]
to access that, it is OK, but a little verbose perhaps. And if there are lots of columns, this may take some time. Is is possible to make ft[0,]
work?
Thanks!
Hi @hope-data-science, you're right, ft[0, ]
should definitely have an output identical to DT[0, ]
, using the example code above:
# identical
x[1, ]
#> X Y
#> 1 1 2
fst_table[1, ]
#> X Y
#> 1 1 2
# not identical
x[0, ]
#> [1] X Y
#> <0 rows> (or 0-length row.names)
fst_table[0, ]
#> Error in read_fst(meta_info$path, from = min_row, to = max_row): Parameter 'from' should have a numerical value equal or larger than 1.
thanks for pointing that out, I'll schedule a fix for the next release!
added as a separate issue
I've designed a new tool to work with fst, which is considered to be more memory efficient. Link: https://hope-data-science.github.io/tidyft/articles/Introduction.html
I find fst_table a very useful class, do not have to read the file physically but could get enough information to know how to process. Perhaps there could be more methods to deal on it, e.g.
is.fst.table
,path.fst.table
,summary.fst.table
, etc. I think this is going to be popular in big data analysis in R.