Closed martinblostein closed 6 years ago
Hi @martinblostein, thanks a lot for your pull request!
Nice work, so the variables in the expression are categorized as known columns in the fst
file or as other variables. And further processing is done according to the specific category. That looks great!
I'm going to spent some time on your pull request to study it in more detail (as it deserves) so please be patient a little while longer until it's merged!
Nice work, so the variables in the expression are categorized as known columns in the fst file or as other variables. And further processing is done according to the specific category. That looks great!
That is the idea, but I didn't make that classification explicit yet. I think moving forward we definitely should! You may notice that the current code gets confused because when a new expression is introduced that is in the disk table, but not in the current table state, it thinks we want to refer to the on-disk column. Making this categorization explicit will fix that.
Hi @martinblostein, I've merged your pull request into the dev
branch with some minor tweaks, nice work!
At first, my idea was to completely exclude any expression parsing and substitutions from the table_proxy
object. But as you mention in the code, the parsing mechanism for a dplyr
interface will be very similar, so why not share that code among interfaces? The only rule would be that columns are present in the expression
object as unevaluated symbols.
Expressions are simplified to be in terms of on-disk columns and unknown symbols, but are not handled beyond that stage. Columns can be selected, rearranged, renamed and duplicated.
Example of functionality:
The main idea of this PR is the introduction of the
colexps
field in the table proxy object. This object is used to track the current column-state of the table. It is a list of expressions, with symbols of either on-disk columns or extra in-memory data.To go with the
colexps
field, I added atable_proxy_transform
method (terminology roughly based on dplyr). This method is used to updatecolexps
by substituting the exsting column expressions into the new ones. This method is extremely flexible (probably to a fault) and agnostic to the interface implementation.Note: A lot of the code in
table_proxy_methods.R
is just hacked together so that the tables print nicely. Before working on this further I think we should consider unifying or refactoringtable_proxy_read_full
andtable_proxy_read_range
to avoid code duplication. I also wonder if the range method is even necessary; reading a range of a proxy table is the same as reading a "full" proxy table with a different slice_map.