Open MarcusKlik opened 6 years ago
Some sample code:
# some test data
x <- data.frame(X = 1:100, Y = LETTERS[1 + (1:100) %% 26])
fst::write_fst(x, "1.fst")
# creates an instance of a data_table_interface class and sets
# a _remote_table_fst_ object as _remote table_
ft <- fsttable::fst_table("1.fst")
# just print the whole table
print(ft)
#> <fst file>
#> 100 rows, 2 columns
#>
#> X Y
#> <int> <fact>
#> 1 1 B
#> 2 2 C
#> 3 3 D
#> 4 4 E
#> 5 5 F
#> -- -- --
#> 96 96 S
#> 97 97 T
#> 98 98 U
#> 99 99 V
#> 100 100 W
Let me try and sum up to check my understanding:
Yes, that would be the most flexible setup I think. At some time in the future, we could have the following packages to have a complete separation of concerns:
A package remote_table
(or table_proxy
or whatever :-)): This package would contain the table_proxy class and all the functionality needed to have a smart proxy for some table. It would also contain the generic methods for a remote table (but not an actual implementation).
A package that contains the implementation of the generic _remotetable functions for the fst
package. Perhaps that package would be called fst_remote
. That package imports the remote_table
package, because that's where it gets the remote table generics from. This is the only package where we read or write actual data to a fst
file. The other packages are completely agnostic to the fst format.
With these two package, any interface that can control the table proxy class, could serve as a front-end. It would be great if there would be (at least) two interface packages for the remote_table
package:
data.table.remote
and dplyr_remote
. These packages just provide an interface to the table proxy class, nothing more. So all the computing-on-the-language stuff is in there (such as parsing of the i and j parameters in data.table.remote
). For data.table.remote
, we can't really borrow code from the data.table
package, because in data.table
the code actually computes something. In data.table.remote
, the work is only delegated to the remote_table
package.Once those two package are available, any backend that implements the remote table generics, gets a data.table
and dplyr
interface with smart caching for free! For fst
that means that there could be two packages:
fsttable
and fstplyr
. These package just link the appropriate interface package (data.table.remote
or dplyr_remote
or sql_remote
) to the remote table package (fst_remote
).Off course, these packages could also contain the implementations for the remote table and interface at first, so that we only need a single package instead of three. (so fsttable
, fst_remote
and data.table.remote
in one package) But separating the interface would make it available to other backends as well, and that would open up many new possibilities (e.g. doing a right-join with a fsttable and a csvtable for example without actually loading the data)
It's a lot to wrap your head around (but writing it down helps :-))
Thanks for your questions!
Hi @Yuri-M-Dias, thanks again for contacting me, as discussed, this issue explains the design of the fsttable
package. I've also added a README to show some of the features. Please let me know if you have any questions or suggestions!
After some tweaking, the design is now as follows:
fst
backend).To connect everything together, method
fsttable
(define here) calls the_table_proxy_()
constructor with a specific implementation of a _remotetable. In this case that is ourremote_table_fst
implementation, defined here.With this design, the
data.table
interface is completely separated from the _remotetable implementation. Other _remotetable's (other backends) could easily be added when they implement the generic functions. The _data_tableinterface class and generic functions can be refactored into a separate package. That package could be used by other backend packages to define adata.table
interface for that specific backend, which they could do by implementing a custom version of the _remotetable generics.(only relevant when we can do more than just printing the whole table :-))