echasnovski / keyholder

Store Data About Rows
https://echasnovski.github.io/keyholder/
Other
7 stars 2 forks source link

Validity? #2

Closed hadley closed 5 years ago

hadley commented 6 years ago

Maybe you already do this (but it's not in the readme), but it would be nice to check that you have supplied valid keys (i.e. they are unique, and do not contain missing values)

echasnovski commented 6 years ago

In current design keys shouldn't have properties of "primary keys". They are just a set of columns hidden from the subsetting and modifying operators but which are affected by them. This enables the following code to restore original data after (possibly harmful) modifications:

mtcars %>%
  key_by(vs) %>%
  mutate(vs = am) %>%
  filter(vs == 1) %>%
  restore_keys(vs)

Probably, this issue is a consequence of not ideal naming. Initially package was designed to use only ".id" as the key (primary one) but I got a little carried away. "Keys" may be thought of as "foreign keys".

hadley commented 6 years ago

In that case I feel like I don't understand the motivation. When does this sort of problem crop up during a data analysis?

echasnovski commented 6 years ago

Not in actual data analysis but rather developing for it. Historically keyholder emerged from my other package ruler (for data frame validation using dplyr-style validation functions). The initial goal of keyholder was to invisibly track rows during application of some user defined function (preferably created via pipe using only dplyr functions). For example:

library(dplyr)
library(keyholder)

modify <- . %>%
  filter(vs == 1) %>%
  arrange(mpg)

mtcars %>%
  use_id() %>%
  modify() %>%
  pull_key(.id)
#>  [1] 11  6 10  4 32 21  3  9  8 26 19 28 18 20

After implementing this feature I decided to add functionality for using arbitrary data as "keys". The use case for this is the need for ensuring that user defined function doesn't affect important columns of input data frame. To do that, one should store them as keys, apply user function and restore keys:

weird_modify <- . %>%
  transmute(
    vs = am + mpg,
    new_col = gear + 1
  ) %>%
  slice(1:4)

mtcars %>%
  key_by(starts_with("c")) %>%
  weird_modify() %>%
  restore_keys_all()
#> # A keyed object. Keys: cyl, carb 
#> # A tibble: 4 x 4
#>      vs new_col   cyl  carb
#>   <dbl>   <dbl> <dbl> <dbl>
#> 1  22.0       5     6     4
#> 2  22.0       5     6     4
#> 3  23.8       5     4     1
#> 4  21.4       4     6     1
echasnovski commented 6 years ago

For actual data analysis I came up with, somewhat far-fetched, case: if someone wants to modify all but handful of columns with mutate_if(), here is the way:

library(dplyr)
library(rlang)
library(keyholder)

mtcars %>%
  key_by(vs, am, .exclude = TRUE) %>%
  mutate_if(rlang::is_integerish, ~ . * 2) %>%
  restore_keys_all(.remove = TRUE, .unkey = TRUE) %>%
  head()
#>    mpg cyl disp  hp drat    wt  qsec gear carb vs am
#> 1 21.0  12  160 220 3.90 2.620 16.46    8    8  0  1
#> 2 21.0  12  160 220 3.90 2.875 17.02    8    8  0  1
#> 3 22.8   8  108 186 3.85 2.320 18.61    8    2  1  1
#> 4 21.4  12  258 220 3.08 3.215 19.44    6    2  1  0
#> 5 18.7  16  360 350 3.15 3.440 17.02    6    4  0  0
#> 6 18.1  12  225 210 2.76 3.460 20.22    6    2  1  0

This changes the column order because columns vs and am were actually removed from data frame and then restored back. For not removing them and keeping column order one can omit .exclude = TRUE. However it means that function ~ . * 2 is applied to those columns too (which sometimes can be undesirable).

hadley commented 6 years ago

Oh I see. I wonder if mutate_if() and mutate_all() should have some default way of ignoring some columns. I think we already have similar logic built in so that you can't modify the grouping columns.

echasnovski commented 6 years ago

Well, yes, modifying grouping variables is prohibited. With this approach the same effect can be achieved for any set of columns without modifying actual data. Also restoring can be done with renaming to create "old" versions of columns (which kind of cool :) ).

However, the main goal is to track data about rows while performing dplyr transformations. And this is needed to be done without modifying actual data frame (using only attributes) to be used inside user defined functions.