fstpackage / fsttable

An interface to fast on-disk data tables stored with the fst format
GNU Affero General Public License v3.0
27 stars 4 forks source link

Implement data.table-like j argument, without evalutation #37

Closed martinblostein closed 6 years ago

martinblostein commented 6 years ago

Expressions are simplified to be in terms of on-disk columns and unknown symbols, but are not handled beyond that stage. Columns can be selected, rearranged, renamed and duplicated.

Example of functionality:

x <- data.table::data.table(a = 1:10, b = LETTERS[1:10])
tmp <- tempfile()
fst::write_fst(x, tmp)
ft <- fsttable::fst_table(tmp)
ft[, .(x = a, y = a + b)][1:5, .(z = y / 2, w = x)]

# <fst file>
# 5 rows, 2 columns
# 
#           z     w
#        <NA> <int>
# 1 (a + b)/2     1
# 2 (a + b)/2     2
# 3 (a + b)/2     3
# 4 (a + b)/2     4
# 5 (a + b)/2     5

The main idea of this PR is the introduction of the colexps field in the table proxy object. This object is used to track the current column-state of the table. It is a list of expressions, with symbols of either on-disk columns or extra in-memory data.

To go with the colexps field, I added a table_proxy_transform method (terminology roughly based on dplyr). This method is used to update colexps by substituting the exsting column expressions into the new ones. This method is extremely flexible (probably to a fault) and agnostic to the interface implementation.

Note: A lot of the code in table_proxy_methods.R is just hacked together so that the tables print nicely. Before working on this further I think we should consider unifying or refactoring table_proxy_read_full and table_proxy_read_range to avoid code duplication. I also wonder if the range method is even necessary; reading a range of a proxy table is the same as reading a "full" proxy table with a different slice_map.

MarcusKlik commented 6 years ago

Hi @martinblostein, thanks a lot for your pull request!

Nice work, so the variables in the expression are categorized as known columns in the fst file or as other variables. And further processing is done according to the specific category. That looks great!

I'm going to spent some time on your pull request to study it in more detail (as it deserves) so please be patient a little while longer until it's merged!

martinblostein commented 6 years ago

Nice work, so the variables in the expression are categorized as known columns in the fst file or as other variables. And further processing is done according to the specific category. That looks great!

That is the idea, but I didn't make that classification explicit yet. I think moving forward we definitely should! You may notice that the current code gets confused because when a new expression is introduced that is in the disk table, but not in the current table state, it thinks we want to refer to the on-disk column. Making this categorization explicit will fix that.

MarcusKlik commented 6 years ago

Hi @martinblostein, I've merged your pull request into the dev branch with some minor tweaks, nice work!

At first, my idea was to completely exclude any expression parsing and substitutions from the table_proxy object. But as you mention in the code, the parsing mechanism for a dplyr interface will be very similar, so why not share that code among interfaces? The only rule would be that columns are present in the expression object as unevaluated symbols.