Three layer interface - Githubissues

MarcusKlik commented 6 years ago

After some tweaking, the design is now as follows:

The _data_tableinterface class (defined here) acts as a wrapper around the controller object of class _tableproxy (defined here).
The _tableproxy contains code for smart cashing and calculating on (subsets of) the data.
The constructor for the _tableproxy object needs a _remotetable object as an argument. So it wraps an abstract _remotetable object.
That _remotetable object should implement the generic functions defined here. Together, these generic functions are the only connection to everything that has to be done on the backend (in our case a fst backend).

To connect everything together, method fsttable (define here) calls the _table_proxy_() constructor with a specific implementation of a _remotetable. In this case that is our remote_table_fst implementation, defined here.

With this design, the data.table interface is completely separated from the _remotetable implementation. Other _remotetable's (other backends) could easily be added when they implement the generic functions. The _data_tableinterface class and generic functions can be refactored into a separate package. That package could be used by other backend packages to define a data.table interface for that specific backend, which they could do by implementing a custom version of the _remotetable generics.

(only relevant when we can do more than just printing the whole table :-))

MarcusKlik commented 6 years ago

Some sample code:

# some test data
x <- data.frame(X = 1:100, Y = LETTERS[1 + (1:100) %% 26])
fst::write_fst(x, "1.fst")

# creates an instance of a data_table_interface class and sets
# a _remote_table_fst_ object as _remote table_
ft <- fsttable::fst_table("1.fst")

# just print the whole table
print(ft)
#> <fst file>
#> 100 rows, 2 columns
#> 
#>         X      Y
#>     <int> <fact>
#> 1       1      B
#> 2       2      C
#> 3       3      D
#> 4       4      E
#> 5       5      F
#> --     --     --
#> 96     96      S
#> 97     97      T
#> 98     98      U
#> 99     99      V
#> 100   100      W

martinblostein commented 6 years ago

Let me try and sum up to check my understanding:

The top layer (\<userinterface>_interface) determines the user interface.
The middle layer (table_proxy) is fixed for all possible implementations.
The bottom layer (remote_table_\<fileformat>) determines the on-disk file format.

MarcusKlik commented 6 years ago

Yes, that would be the most flexible setup I think. At some time in the future, we could have the following packages to have a complete separation of concerns:

A package remote_table (or table_proxy or whatever :-)): This package would contain the table_proxy class and all the functionality needed to have a smart proxy for some table. It would also contain the generic methods for a remote table (but not an actual implementation).
A package that contains the implementation of the generic _remotetable functions for the fst package. Perhaps that package would be called fst_remote. That package imports the remote_table package, because that's where it gets the remote table generics from. This is the only package where we read or write actual data to a fst file. The other packages are completely agnostic to the fst format.

With these two package, any interface that can control the table proxy class, could serve as a front-end. It would be great if there would be (at least) two interface packages for the remote_table package:

data.table.remote and dplyr_remote. These packages just provide an interface to the table proxy class, nothing more. So all the computing-on-the-language stuff is in there (such as parsing of the i and j parameters in data.table.remote). For data.table.remote, we can't really borrow code from the data.table package, because in data.table the code actually computes something. In data.table.remote, the work is only delegated to the remote_table package.

Once those two package are available, any backend that implements the remote table generics, gets a data.table and dplyr interface with smart caching for free! For fst that means that there could be two packages:

fsttable and fstplyr. These package just link the appropriate interface package (data.table.remote or dplyr_remote or sql_remote) to the remote table package (fst_remote).

Off course, these packages could also contain the implementations for the remote table and interface at first, so that we only need a single package instead of three. (so fsttable, fst_remote and data.table.remote in one package) But separating the interface would make it available to other backends as well, and that would open up many new possibilities (e.g. doing a right-join with a fsttable and a csvtable for example without actually loading the data)

It's a lot to wrap your head around (but writing it down helps :-))

Thanks for your questions!

MarcusKlik commented 5 years ago

Hi @Yuri-M-Dias, thanks again for contacting me, as discussed, this issue explains the design of the fsttable package. I've also added a README to show some of the features. Please let me know if you have any questions or suggestions!

fstpackage / fsttable

Three layer interface #17