fstpackage / fsttable

An interface to fast on-disk data tables stored with the fst format
GNU Affero General Public License v3.0
27 stars 4 forks source link

Implement [[ for data.table interface using table_proxy #22

Closed martinblostein closed 6 years ago

martinblostein commented 6 years ago

Another small commit to restore some data.table functionality.

Disregarding recursive indexing, I believe this is the complete implementation of [[.datatableinterface, as the rows to return are determined by the table_proxy.

(Sorry for the pull request spam, I was in a hurry and kept pushing the wrong version.)

MarcusKlik commented 6 years ago

Hi @martinblostein, thanks a lot! Yes, that's the idea, the table_proxy keeps a complete picture of reads necessary to fulfill a request from the interface. So for example, when the data.table interface requests the first- and last 5 rows for printing, the table_proxy has to determine the actual number of rows that have to be read from file and in which order (not functional yet):

# reference to fst file and data.table interface, table_proxy and remote_table are created
ft <- fst_table("1.fst")

# interface requests update of proxy row- and column selection, new fst_table created
# the new fst_table contains a row-mask (the selection), 1 column reference and 1
# virtual column that holds the operator (>=) and the primitive (18) to be able to
# compute the contents of the whole column later. 
ft2 <- ft[Year == 2016, .(Amount, Adult = Age >= 18)

# data.table interface requests first- and last 5 rows for printing, table_proxy determines that 
# for this request, because `>=` works per-element, only the first- and last rows of Age
# are needed. so the printing command will be very fast (no significant data required)
print(ft)

Only when a method is used that does not work per-element (or fsttable can't determine that), the whole column needs to be read (and stored in a separate file).

Thanks for the pull request!