WinVector / rquery

Data Wrangling and Query Generating Operators for R. Distributed under choice of GPL-2 or GPL-3 license.
https://winvector.github.io/rquery/
Other
109 stars 15 forks source link

How to do dummy/one-hot encoding in a pipeline? #15

Closed epspi closed 4 years ago

epspi commented 4 years ago

A pipeline needs a factor (actually an int) to be one-hot encoded with right censoring. E.g.

0:  1 0 0 0 0 0
1:  0 1 0 0 0 0
2:  0 0 1 0 0 0
3:  0 0 0 1 0 0
4:  0 0 0 0 1 0
5+: 0 0 0 0 0 1

I see some possible manual avenues but not sure if there is a direct way.

  1. extend with conditionals to manually specify new variables for the desired levels. Results in wider table.
  2. right-join on a table column containing all the desired levels. Results in longer table.
  3. complete_design?
JohnMount commented 4 years ago

In general re-coding data is what our vtreat package is for ( https://github.com/WinVector/vtreat ). But to directly do this in rquery we can use methods from the rquery many columns vignette ( https://winvector.github.io/rquery/articles/rquery_many_columns.html ).

library(wrapr)
library(rquery)
library(rqdatatable)

d <- data.frame(x = 0:7)

codes <- paste0('x_eq_', 0:4) := paste0('as.numeric(x == ', 0:4, ')')
codes <- c(codes, 'x_ge_5' := 'as.numeric(x >= 5)')

ops <- local_td(d) %.>% 
  extend_se(., codes)

d %.>% ops
#>    x x_eq_0 x_eq_1 x_eq_2 x_eq_3 x_eq_4 x_ge_5
#> 1: 0      1      0      0      0      0      0
#> 2: 1      0      1      0      0      0      0
#> 3: 2      0      0      1      0      0      0
#> 4: 3      0      0      0      1      0      0
#> 5: 4      0      0      0      0      1      0
#> 6: 5      0      0      0      0      0      1
#> 7: 6      0      0      0      0      0      1
#> 8: 7      0      0      0      0      0      1