fst_table as a serious class

hope-data-science commented 4 years ago

I find fst_table a very useful class, do not have to read the file physically but could get enough information to know how to process. Perhaps there could be more methods to deal on it, e.g. is.fst.table, path.fst.table, summary.fst.table, etc. I think this is going to be popular in big data analysis in R.

MarcusKlik commented 4 years ago

Hi @hope-data-science , thanks for the feature request!

Having more options to manipulate and view characteristics of the offline dataset would be very useful indeed. But those can be better served in separate R packages that import fst for the low-level operations (such as the fstplyr, fsttable or your tidyfst packages).

So fst can provide the lower-level operations and access to meta-data while the downstream package can use those functionalities to provide functionality in their own specific API. Does that sound reasonable?

For example, fst can provide the following low-level abilities:

read from file using custom (random) row-filters
read from file using a custom ordering
read from file using group-windows (in the background) and apply custom R operations on each group
read from file and sort the result while reading (on background threads)
join two fst files using (sorted) keys

Downstream packages could use these features to facilitate their own API's and provide functionality like offline sorting, partial loading, etc...

hope-data-science commented 4 years ago

I am not so familiar with the implementations underneath, what you mention as "low-level abilities" are acutually quite "high-level" to me. If these abilities could be done in fst, faster and memory efficient, I think that would be amazing! At the very first, my expectations are just:

How to access data more efficiently from fst file? How to subset data more flexibly (by group? filter? slice? select?[I think I've handled this part in some way] )?

I did make a function named filter_fst, but that might not be fast. I think fst could help to facilitate the access part very well. And about the computation part, if that can really be brought to us, that is a brand new revolution! I think that will open a new era to do computation out-of-memory, especially for some tough tasks.

BTW: A small problem, I am tring to get the zero row of fst table but failed. In data.frame or data.table, you can get DT[0,] to get the column names and classes, this facilitates selection. Maybe fst table could do that too? Currently, I used ft[1,][0,] to access that, it is OK, but a little verbose perhaps. And if there are lots of columns, this may take some time. Is is possible to make ft[0,] work?

Thanks!

MarcusKlik commented 4 years ago

Hi @hope-data-science, you're right, ft[0, ] should definitely have an output identical to DT[0, ], using the example code above:

# identical
x[1, ]
#>   X Y
#> 1 1 2
fst_table[1, ]
#>   X Y
#> 1 1 2

# not identical
x[0, ]
#> [1] X Y
#> <0 rows> (or 0-length row.names)
fst_table[0, ]
#> Error in read_fst(meta_info$path, from = min_row, to = max_row): Parameter 'from' should have a numerical value equal or larger than 1.

thanks for pointing that out, I'll schedule a fix for the next release!

MarcusKlik commented 4 years ago

added as a separate issue

hope-data-science commented 4 years ago

I've designed a new tool to work with fst, which is considered to be more memory efficient. Link: https://hope-data-science.github.io/tidyft/articles/Introduction.html

fstpackage / fst

fst_table as a serious class #236