fstpackage / fsttable

An interface to fast on-disk data tables stored with the fst format
GNU Affero General Public License v3.0
27 stars 4 forks source link

Three layer interface #17

Open MarcusKlik opened 6 years ago

MarcusKlik commented 6 years ago

After some tweaking, the design is now as follows:

To connect everything together, method fsttable (define here) calls the _table_proxy_() constructor with a specific implementation of a _remotetable. In this case that is our remote_table_fst implementation, defined here.

With this design, the data.table interface is completely separated from the _remotetable implementation. Other _remotetable's (other backends) could easily be added when they implement the generic functions. The _data_tableinterface class and generic functions can be refactored into a separate package. That package could be used by other backend packages to define a data.table interface for that specific backend, which they could do by implementing a custom version of the _remotetable generics.

(only relevant when we can do more than just printing the whole table :-))

MarcusKlik commented 6 years ago

Some sample code:

# some test data
x <- data.frame(X = 1:100, Y = LETTERS[1 + (1:100) %% 26])
fst::write_fst(x, "1.fst")

# creates an instance of a data_table_interface class and sets
# a _remote_table_fst_ object as _remote table_
ft <- fsttable::fst_table("1.fst")

# just print the whole table
print(ft)
#> <fst file>
#> 100 rows, 2 columns
#> 
#>         X      Y
#>     <int> <fact>
#> 1       1      B
#> 2       2      C
#> 3       3      D
#> 4       4      E
#> 5       5      F
#> --     --     --
#> 96     96      S
#> 97     97      T
#> 98     98      U
#> 99     99      V
#> 100   100      W
martinblostein commented 6 years ago

Let me try and sum up to check my understanding:

MarcusKlik commented 6 years ago

Yes, that would be the most flexible setup I think. At some time in the future, we could have the following packages to have a complete separation of concerns:

With these two package, any interface that can control the table proxy class, could serve as a front-end. It would be great if there would be (at least) two interface packages for the remote_table package:

Once those two package are available, any backend that implements the remote table generics, gets a data.table and dplyr interface with smart caching for free! For fst that means that there could be two packages:

Off course, these packages could also contain the implementations for the remote table and interface at first, so that we only need a single package instead of three. (so fsttable, fst_remote and data.table.remote in one package) But separating the interface would make it available to other backends as well, and that would open up many new possibilities (e.g. doing a right-join with a fsttable and a csvtable for example without actually loading the data)

It's a lot to wrap your head around (but writing it down helps :-))

Thanks for your questions!

MarcusKlik commented 5 years ago

Hi @Yuri-M-Dias, thanks again for contacting me, as discussed, this issue explains the design of the fsttable package. I've also added a README to show some of the features. Please let me know if you have any questions or suggestions!