Closed hadley closed 5 years ago
In current design keys shouldn't have properties of "primary keys". They are just a set of columns hidden from the subsetting and modifying operators but which are affected by them. This enables the following code to restore original data after (possibly harmful) modifications:
mtcars %>%
key_by(vs) %>%
mutate(vs = am) %>%
filter(vs == 1) %>%
restore_keys(vs)
Probably, this issue is a consequence of not ideal naming. Initially package was designed to use only ".id" as the key (primary one) but I got a little carried away. "Keys" may be thought of as "foreign keys".
In that case I feel like I don't understand the motivation. When does this sort of problem crop up during a data analysis?
Not in actual data analysis but rather developing for it. Historically keyholder
emerged from my other package ruler (for data frame validation using dplyr-style validation functions).
The initial goal of keyholder
was to invisibly track rows during application of some user defined function (preferably created via pipe using only dplyr
functions). For example:
library(dplyr)
library(keyholder)
modify <- . %>%
filter(vs == 1) %>%
arrange(mpg)
mtcars %>%
use_id() %>%
modify() %>%
pull_key(.id)
#> [1] 11 6 10 4 32 21 3 9 8 26 19 28 18 20
After implementing this feature I decided to add functionality for using arbitrary data as "keys". The use case for this is the need for ensuring that user defined function doesn't affect important columns of input data frame. To do that, one should store them as keys, apply user function and restore keys:
weird_modify <- . %>%
transmute(
vs = am + mpg,
new_col = gear + 1
) %>%
slice(1:4)
mtcars %>%
key_by(starts_with("c")) %>%
weird_modify() %>%
restore_keys_all()
#> # A keyed object. Keys: cyl, carb
#> # A tibble: 4 x 4
#> vs new_col cyl carb
#> <dbl> <dbl> <dbl> <dbl>
#> 1 22.0 5 6 4
#> 2 22.0 5 6 4
#> 3 23.8 5 4 1
#> 4 21.4 4 6 1
For actual data analysis I came up with, somewhat far-fetched, case: if someone wants to modify all but handful of columns with mutate_if()
, here is the way:
library(dplyr)
library(rlang)
library(keyholder)
mtcars %>%
key_by(vs, am, .exclude = TRUE) %>%
mutate_if(rlang::is_integerish, ~ . * 2) %>%
restore_keys_all(.remove = TRUE, .unkey = TRUE) %>%
head()
#> mpg cyl disp hp drat wt qsec gear carb vs am
#> 1 21.0 12 160 220 3.90 2.620 16.46 8 8 0 1
#> 2 21.0 12 160 220 3.90 2.875 17.02 8 8 0 1
#> 3 22.8 8 108 186 3.85 2.320 18.61 8 2 1 1
#> 4 21.4 12 258 220 3.08 3.215 19.44 6 2 1 0
#> 5 18.7 16 360 350 3.15 3.440 17.02 6 4 0 0
#> 6 18.1 12 225 210 2.76 3.460 20.22 6 2 1 0
This changes the column order because columns vs
and am
were actually removed from data frame and then restored back.
For not removing them and keeping column order one can omit .exclude = TRUE
. However it means that function ~ . * 2
is applied to those columns too (which sometimes can be undesirable).
Oh I see. I wonder if mutate_if()
and mutate_all()
should have some default way of ignoring some columns. I think we already have similar logic built in so that you can't modify the grouping columns.
Well, yes, modifying grouping variables is prohibited. With this approach the same effect can be achieved for any set of columns without modifying actual data. Also restoring can be done with renaming to create "old" versions of columns (which kind of cool :) ).
However, the main goal is to track data about rows while performing dplyr
transformations. And this is needed to be done without modifying actual data frame (using only attributes) to be used inside user defined functions.
Maybe you already do this (but it's not in the readme), but it would be nice to check that you have supplied valid keys (i.e. they are unique, and do not contain missing values)