title: "data wrangling on star like data v 1.0" author: "me" classoption: landscape toc: true toc_depth: 2 header-includes:

\usepackage{longtable} output: pdf_document: default html_notebook: default urlcolor: blue

\pagebreak

release notes

initial version

required libraries

global load, libraries sometimes loaded in respective section to indicate which is needed for which operation or to ensure correct function is used in case of name clash
for unloading a package without restart of console see @ \textcolor{blue}{how to unload a package without restarting r}

library(pander)
library(dplyr)
library(readr)
library(tidyr)
library(xtable)
library(ggplot2)
library(scales)
library(grid)
library(lazyeval)
library(data.table)

global options

for a discussion on stringsAsFactors @ \textcolor{blue}{stringsAsFactors: An unauthorized biography}

# output options
options("scipen"=100)
#opts_chunk$set(tidy.opts=list(width.cutoff=60),tidy=TRUE)
options(xtable.floating = FALSE)
options(xtable.timestamp = "")
options(xtable.comment = FALSE)
# tex / pandoc options for pdf creation
x <- Sys.getenv("PATH")
y <- paste(x, "E:\\Datenordner\\Downloads\\miktex\\miktex\\bin", sep=";")
Sys.setenv(PATH = y)
# always stringsAsFactors = F; if factors needed, declare them explicitly
options(stringsAsFactors = F)
# set work directory for read / write operations
path <- "E:/Datenordner/raw_data"
# if data is some place else, set root.dir option
opts_knit$set(root.dir = path)
#setwd(paste(path, sep="/"))

\pagebreak

overview

create simulated fdw- / ste-like data for reproducability
data wrangling operations with dplyr/tidyr (hadleyverse) vs data.table package
focus is on syntax and workflow comparison
operations covered
- slice & dice
- rename
- filter
- arrange / order / sort
- combining / chaining / pipes
- distinct / unique / duplicate
- adding / removing / adjusting columns
- summarise
- sampling
- merge / join / lookup
- reshaping (long-to-wide, wide-to-long)
- import / export
for future releases
- integration of more graphics
- pretty formatting
- performance / benchmarking
- shiny app
- more on functions
- sas / python integration
- cubes / olap
- R with databases
- R with Tableau
for an extended discussion on pros/cons of both libraries, see e.g. @ \textcolor{blue}{data.table vs dplyr: can one do something well the other can't or does poorly?}

\pagebreak

create simulated data

data structure

data will be set up in star schema, i.e. with dimensions (attributes) and facts (metrics). each entity has an identifier for lookups. this is a oversimplifiied version of the data warehouse
the fact table holds all metrics references all dimension tables via respective key. for simplicity each fact holds all keys

knitr::include_graphics("./fdw_star.png")

country table e.g. country of domicile

country table holds information on country and region

country_table <- 
  data.frame(key_var = c("DE", "FR", "UK", "US", "CN", "JP"),
             country_name = c("Germany", "France", "United_Kingdom",
                              "United_States", "China", "Japan"),
             country_region = c("Europe", "Europe", "Europe",
                                "Americas", "Asia", "Asia"),
             country_currency = c("EUR", "EUR", "GBP", "USD", 
                                  "CNY", "JPY"),
             stringsAsFactors = F
  )

month table

mimics month-end date (label)
it's assumed that there is only one final label per month
could be adjusted to include scenario label when thinking about stress data, i.e. more than one scenario on the same month-end
note that scenario label could stay constant while month label changes e.g. same scenario applied to different month-end

month_table <- 
  data.frame(key_var = 1:12,
             month_desc = c("Jan", "Feb", "Mar", "Apr", "May", "Jun", 
                            "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"),
             month_quarter = c("Q1","Q1","Q1","Q2","Q2","Q2",
                               "Q3","Q3","Q3","Q4","Q4","Q4"),
             stringsAsFactors = F
  )

ubr table

represents businesses in form of hierarchy (tree)
lowest level is smallest unit (key) while higher levels can contain several lower levels
currently 13 levels implemented
could be made much more realistic if that is a priority
n_ubr allows scaling of ubrs i.e. number of lowest ubrs

n_ubr <- 1000

ubr_table <-
  data.frame(key_var = sample(x = n_ubr, size = n_ubr, rep = F),
             ubr_level12 = sample(x = (1*n_ubr + 1) : (2*n_ubr), 
                                  size = n_ubr, rep = F),
             ubr_level11 = sample(x = (2*n_ubr + 1) : (3*n_ubr), 
                                  size = n_ubr, rep = F),
             ubr_level10 = sample(x = (3*n_ubr + 1) : (4*n_ubr), 
                                  size = n_ubr, rep = F),
             ubr_level09 = sample(x = (4*n_ubr + 1) : (5*n_ubr), 
                                  size = n_ubr, rep = F),
             ubr_level08 = sample(x = (6*n_ubr + 1) : (7*n_ubr), 
                                  size = n_ubr, rep = F),
             ubr_level07 = sample(x = (7*n_ubr + 1) : (8*n_ubr), 
                                  size = n_ubr, rep = F),
             ubr_level06 = sample(x = (8*n_ubr + 1) : (9*n_ubr), 
                                  size = n_ubr, rep = F),
             ubr_level05 = sample(x = (9*n_ubr + 1) : (10*n_ubr), 
                                  size = n_ubr, rep = F),
             ubr_level04 = sample(x = (10*n_ubr + 1) : (11*n_ubr), 
                                  size = n_ubr, rep = F),
             ubr_level03 = sample(x = (11*n_ubr + 1) : (12*n_ubr), 
                                  size = n_ubr, rep = F),
             ubr_level02 = sample(x = (12*n_ubr + 1) : (13*n_ubr), 
                                  size = n_ubr, rep = F),
             ubr_level01 = sample(x = (13*n_ubr + 1) : (14*n_ubr), 
                                  size = n_ubr, rep = F),
             stringsAsFactors = F
  )

source system table

represents source system of data
useful for analytics, data checks and compound key construction

inst_id_table <-
  data.frame(key_var = c(100, 200, 300, 400, 500, 600, 700,
                     "9ERG", "ADJB2", "CCS", "CDS"),
             inst_id_desc = c("NA", "NA", "NA", "NA", "NA", "NA", "NA",
                              "who knows", "adjustment basel 2", 
                              "collateral shift", "credit default swap"),
             stringsAsFactors = F
  )

product type table

gives product type

product_type_table <-
  data.frame(key_var = sample(x = 10000:100000, size = 10, rep = F),
             product_type_desc = c("loan", "interest rate swap", 
                                   "interest rate forward",
                                   "repo", "option", "interest rate future",
                                   "asset-backed security", "cash", 
                                   "non-cash collateral", "other"),
             stringsAsFactors = F
  )

counterparty type table

gives counterparty type

counterparty_type_table <-
  data.frame(key_var = c("A1", "A2", "A3", "B1", "B2", "B3", "C1", "C2", "C3"),
             counterparty_type_desc = c("Securitisation - Originator", "German SME",
                                        "Large Corporate", "Other SME", 
                                        "Development Bank", "Commercial Bank", 
                                        "Other FI", "Sovereign", "Central Counterparty"),
             stringsAsFactors = F
  )

corep class table

gives exemplary corep class
can be extended with further info

corep_table <-
  data.frame(key_var = 
               c("SOV",
                       "INST",
                       "CORP",
                       "CORP_SL",
                       "CORP_SL_RE",
                       "CORP_SL_NRE",
                       "CORP_SME",
                       "CORP_SME_RE",
                       "CORP_SME_NRE",
                       "CORP_SME_SUPPFA",
                       "CORP_NSME",
                       "CORP_OTH",
                       "CORP_OTH_RE",
                       "CORP_OTH_NRE",
                       "RTL",
                       "RTL_RE",
                       "RTL_RE_SME",
                       "RTL_RE_NSME",
                       "RTL_QR",
                       "RTL_OTH",
                       "RTL_OTH_SME",
                       "RTL_OTH_NSME",
                       "RTL_SME",
                       "RTL_SME_SUPPFAC",
                       "RTL_NSME",
                       "EQU",
                       "SEC",
                       "OTH",
                       "HR",
                       "COVB",
                       "ST",
                       "CIU",
                       "RGOV",
                       "PSE",
                       "MDB",
                       "IORG",
                       "SECM",
                       "SECM_SME",
                       "SECM_NSME"),
             stringsAsFactors = F)

simulation function

function to create fdw / ste-like flat file on transaction level
n controls number of entries / observations - can be used to test scalability
function argument "run" not assigned but may represent scenario later

create_obs <- function(n_obs, month, year, run) {
  # generate transaction data randomly
  # internally the prob vector is scaled to sum to 1
  id <- sample(seq(from = 1, to = n_obs, by = 1), n_obs, rep = F)
  # assigning higher probabilities for more important countries
  country <- sample(country_table$key_var, n_obs, rep = T,
                    prob = c(4, 1, 2, 3, 1, 1))
  month <- month
  year <- year
  ubrtrn <- sample(ubr_table$key_var, n_obs, rep = T)
  inst_id <- sample(inst_id_table$key_var, n_obs, rep = T)
  product_type <- sample(product_type_table$key_var, n_obs, rep = T)
  cpy_type <- sample(counterparty_type_table$key_var, n_obs, rep = T)
  corep <- sample(corep_table$key_var, n_obs, rep = T)
  rwa <- runif(n = n_obs, min = 0, max = 100000000)
  el <- runif(n = n_obs, min = 0, max = 10000000)
  ec <- runif(n = n_obs, min = 0, max = 10000000)
  # assigning higher probabilities for low PDs to reflect better portfolio
  # distribution
  pd <- sample(seq(from = 0, to = 1, by = 0.001), n_obs, rep = T,
               prob = c(seq(from = 500, to = 1, by = -1),
                        rep(x = 1, times = 501)
                        )
               )
  lgd <- sample(seq(from = 0, to = 1, by = 0.2), n_obs, rep = T)
  data <- data.frame(id = id,
                     month = month,
                     year = year,
                     country = country,
                     ubrtrn = ubrtrn,
                     inst_id = inst_id,
                     product_type = product_type,
                     corep = corep,
                     cpy_type = cpy_type,
                     rwa = rwa,
                     el = el,
                     ec = ec,
                     pd = pd,
                     lgd = lgd,
                     stringsAsFactors = F)

  return(data)
}

create data

set seed for result reproducability
use n_obs to control for size / scalability
data is created on month-end basis
create regular data frame and data.table object for comparisons
naming convention data = data.frame object; dt = data.table object
note that true copies will demand additional memory so initial size should be controlled
data.table selectively inherits from data.frame class

base data

set.seed(123)
data_201501 <- create_obs(n_obs=10000, month=1, year=2015)
dt_201501 <- data.table(data_201501)

looking at summary of data. while data.frame would print everything until max_out, dt only prints first and last 5 rows if certain row number is exceeded. use str function to only look at summary.

str(data_201501)
str(dt_201501)
dt_201501

simulate slightly different base data for another month-end. take first month-end to ensure meta data stays constant (e.g. country does not change over months). allowing +/- 10% change over months via sampling of multiplier.

# take deep copy rather than reference
data_201502 <- copy(data_201501)
dt_201502 <- copy(dt_201501)
# define columns to change over time
cols <- c("rwa", "el", "ec", "pd", "lgd")

create a sampling function that allows to determine range of multiplier and step size

foo <- function(x, start, end, step, repl) { 
  x * sample(seq(from = start, to = end, by = step), size = 1, replace = repl)
  }

# vectorize function by removing size argument
foo_vec <- function(x, start, end, step, repl) { 
  x * sample(seq(from = start, to = end, by = step), replace = repl)
  }

dplyr

dplyr uses non-standard evaluation (NSE)
a lot of functions have standard evaluation equivalents and are generally denoted via underscore "_" after function name
for details on non standard evaluation, see stackoverflow or dplyr doc @ \textcolor{blue}{Non-standard evaluation}, \textcolor{blue}{dplyrs-mutate-each-within-function-works-but-matches-does-not-find-argument}
use pipe operator %>% for chaining (more examples later)

# using standard evaluation version of mutate_each function to operate on vector
# of strings representing columns
data_201502 <- 
  data_201502 %>%
  mutate_each_(funs(foo(., start = 0.9, end = 1.1, step = 0.01, repl = T)), cols)

# adjust month-end
data_201502 <- 
  data_201502 %>% 
  mutate(month = 2,
         year = 2015)

data.table

data.table performs operation on actual object, i.e. no copy, so asignment operator not required
"." operator is short for list()

dt_201502[, (cols) := foo_vec(.SD, start = 0.9, end = 1.1, 
                              step = 0.01, repl = T), .SDcols = cols]

# adjust month-end
dt_201502[, c("year", "month") := .(2015, 2)]

finally make sure that metrics are within sensible range, i.e. pd and lgd lower or equal to one which could have been changed due to sampling.

dplyr

not found a more intuitive and performing solution yet
"which" creates index and increases performance

data_201502 <-
  data_201502 %>%
  mutate(pd = replace(pd, which(pd > 1), 1),
         lgd = replace(lgd, which(lgd > 1), 1))

data.table

adjusting both in one go will not work as each should not be adjusted if the other satisfies condition, e.g. (pd > 1 | lgd > 1)

dt_201502[pd > 1, pd := 1]
dt_201502[lgd > 1, lgd := 1]

stress data

stress will generally introduce a higher value for all metrics but at times may also lead to decreasing exposure metrics, e.g. due to hedges. sampling function should therefore reflect this. do dplyr and data.table in one go as logic has been made clear before.
stress data indicated via "str" prefix

data_str_201501 <- copy(data_201501)
dt_str_201501 <- copy(dt_201501)
data_str_201502 <- copy(data_201502)
dt_str_201502 <- copy(dt_201502)

# different seed to change values
set.seed(321)

# dplyr
data_str_201501 <- 
  data_str_201501 %>%
  mutate_each_(funs(foo(., start = 1, end = 2, step = 0.01, repl = T)), cols)

data_str_201502 <- 
  data_str_201502 %>%
  mutate_each_(funs(foo(., start = 1, end = 2, step = 0.01, repl = T)), cols)

data_str_201501 <-
  data_str_201501 %>%
  mutate(pd = replace(pd, which(pd > 1), 1),
         lgd = replace(lgd, which(lgd > 1), 1))

data_str_201502 <-
  data_str_201502 %>%
  mutate(pd = replace(pd, which(pd > 1), 1),
         lgd = replace(lgd, which(lgd > 1), 1))

# data.table
dt_str_201501[, (cols) := foo_vec(.SD, start = 1, end = 2, 
                                  step = 0.01, repl = T), .SDcols = cols]

dt_str_201502[, (cols) := foo_vec(.SD, start = 1, end = 2, 
                                  step = 0.01, repl = T), .SDcols = cols]

dt_str_201501[pd > 1, pd := 1]
dt_str_201501[lgd > 1, lgd := 1]
dt_str_201502[pd > 1, pd := 1]
dt_str_201502[lgd > 1, lgd := 1]

basic operations

slicing, dicing, wrangling

first/last rows of vector/table
access one/multiple columns
access one element

dplyr

dplyr has no specific way/syntax for all slicing and dicing so base r is used as well

select columns

there are a number of different ways to select

# first two rows, all columns
slice(data_201501, 1:2)
data_201501[1:2, ]
head(data_201501, 2)
# last two rows, all columns
tail(data_201501, 2)
# select columns base r
data_201501[1:2, c("id", "rwa")]
data_201501[1:2, 1:5]

## select columns dplyr
head(select(data_201501, id, rwa), 2)
head(select(data_201501, id, rwa:ec), 2)
head(select(data_201501, -(rwa:ec)), 2)
# regular expressions
head(select(data_201501, starts_with("r")), 2)
head(select(data_201501, ends_with("c")), 2)
head(select(data_201501, contains("product")), 2)
head(select(data_201501, matches(".r.")), 2)
head(select(data_201501, one_of("rwa", "ec", "el")), 2)
# select columns as variable
cols <- c("id", "rwa")
col_nums <- match(cols, names(data_201501))
head(select(data_201501, col_nums), 2)
# using stAndard evaluation equivalent
head(select_(data_201501, "id", "rwa"), 2)

rename

vector of new columns can be provided
use nse version of function for string arguments

head(rename(data_201501, new_col = rwa), 2)

filter rows

use nse version of function for string arguments
use lazyeval interp function for evaluation within string filter

head(filter(data_201501, pd > 0.3 & !(country %in% c("DE", "US"))), 2)

# filter with variable using interp from lazyeval library
# determine metrics and filters dynamically e.g. as input in shiny app
library(lazyeval)
metric_1 <- "pd"
metric_2 <- "rwa"
metric_3 <- "lgd"
metric_4 <- "ec"
criteria <- 
  interp(~ ((var1 <= 0.5 & var2 > 0) | 
                        (var3 <= 0.5 & var4 > 0)), 
                   var1 = as.name(metric_1),
                   var2 = as.name(metric_2),
                   var3 = as.name(metric_3),
                   var4 = as.name(metric_4))

head(filter_(data_201501, criteria), 2)

# mix variable and hard-coded filter with "~" operator
tag1 <- "A3"
head(filter_(data_201501, ~cpy_type == tag1 & corep == "OTH"), 2)

arrange / order

equivalent to sort

head(arrange(data_201501, country, product_type), 2)
# descending
head(arrange(data_201501, country, desc(rwa)), 2)

combining / chaining operations

pipe operator %>% from magrittr library imported by dplyr
gives intuitive syntax and potentially memory efficiency
can be used analogously in data.table although it also has its own piping logic

data_201501 %>%
  filter(country == "DE") %>%
  select(id, ec) %>%
  arrange(desc(ec)) %>%
  head(., 2)

distinct values

efficient version of base r unique()
has standard evaluation equivalent

head(distinct(data_201501, id), 2)

adding / removing / adjusting columns

mutate allows to refer to columns that were just created
"." in dplyr is placeholder for data

data_201501 %>%
  mutate(new_col = rwa - el,
         ec = ec * 2,
         new_col_2 = new_col / 2) %>%
  head(., 2)

# if you only want to keep the new variables, use transmute()
head(transmute(data_201501,
               new_col = rwa - el,
               new_col_2 = new_col / 2), 2)

summarise

very flexible, provides generic construct

summarise(data_201501,
          rwa_min = min(rwa, na.rm = T),
          rwa_median = median(rwa, na.rm = T),
          rwa_mean = mean(rwa, na.rm = T),
          rwa_max = max(rwa, na.rm = T))

random sampling

use sample_n() and sample_frac() to take a random sample of rows
use sample_n() for a fixed number and sample_frac() for a fixed fraction
use replace = TRUE to perform a bootstrap sample, you can weight the sample with the weight argument

head(sample_n(data_201501, 2))
head(sample_frac(data_201501, 0.005, replace = T), 2)

data.table

data.table inherits some functions from data.frame class so syntax is the same
.N contains number of rows

select columns

different ways to select

dt_201501[1]
dt_201501[1, ]
dt_201501[1:2, ]
dt_201501$id[1:5]
# extracting one column does not work
dt_201501[, 1]
# instead give column name
head(dt_201501[, country], 2)
# above subset returns a vector; Try DT[,.(country)] instead. .() is an
# alias for list() and ensures a data.table is returned.
dt_201501[1:2, .(country)]
dt_201501[1:2, list(country)]
mycol <- "country"
# returns a vector
head(dt_201501[[mycol]], 2)
# with = F returns table
dt_201501[1:2, mycol, with = FALSE]
# grab multiple columns
dt_201501[1:2, list(country, cpy_type)]
# penultimate row of DT using `.N`
dt_201501[.N-1]
# dimensions and names
colnames(dt_201501)
dim(dt_201501)
# select row 2 twice and row 3 for selected columns
dt_201501[c(2,2,3), .(id, country)]

rename

operates on original dt

dt_temp <- copy(dt_201501)
setnames(dt_temp, c("rwa", "ec"), c("rwa_new", "ec_new"))
dt_temp[1:2]

filter

use variable filter with criteria parsed as text with eval function
some guy says it's forbidden to eval-parse @ \textcolor{blue}{Filter data.table using inequalities and variable column names}

dt_201501[pd > 0.5 & lgd > 0.2][1:2,]
# use variable filter with criteria parsed as text
criteria <- "pd > 0.5 & lgd > 0.2"
dt_201501[eval(parse(text = criteria))][1:2,]
# use get function
mycol <- "rwa"
dt_201501[get(mycol) > 1000000][1:2,]

arrange / order

see also sort

head(dt_201501[order(rwa), ], 2)
head(dt_201501[order(-rwa), ], 2)

combining / chaining operations

chain square brackets as often as needed
alternatively use magrittr pipe %>%

head(dt_201501[inst_id == "ADJB2" & rwa > 13^7, ][order(rwa)], 2)

distinct values and duplicates

see function family duplicated
table needs to have keys set that should be checked for duplicates
table can have multiple key and checking duplicates in only one still possible
if no keys are set, entire row is treated as key
note that duplicated only checks for duplicate rows with smaller subscripts so even if first row is duplicated later, it will not be true as there is no smaller subscript

dt_with_key <- copy(dt_201501)
# use setkeyv to set more than one key
setkeyv(dt_with_key, c("country", "inst_id"))
key(dt_with_key)

head(dt_with_key, 3)
duplicated(dt_with_key, by=c("country", "inst_id"))[1:5]
# since key is already set, also works without explicitly setting it
duplicated(dt_with_key)[1:5]
# will not work on ad-hoc keys -> result is wrong
head(duplicated(dt_201501, by=c("country", "inst_id")), 2)
# for table with no key entire row will be evaluated as key
duplicated(dt_201501)[1:5]

unique returns a data.table with duplicated rows (by key) removed, or (when no key) duplicated rows by all columns removed
anyDuplicated returns the index i of the first duplicated entry if there is one, and 0 otherwise.
uniqueN is equivalent to length(unique(x)) but much faster for atomic vectors

dt_with_key_unique <- unique(dt_with_key)
dt_with_key_unique[1:5, .(country, inst_id)]
# return index of first duplicate if there is one, otherwise 0
# note by=key(NULL) sets any pre-set key to NULL
anyDuplicated(dt_with_key, by=key(NULL))
anyDuplicated(dt_with_key)
uniqueN(dt_with_key)

adding / removing / adjusting columns

# set country code with less than 2 characters to unknown
dt_201501[nchar(country) < 2, country := "Unknown"]
# set na values to zero
dt_201501[is.na(rwa), rwa := 0]
# add new columns via custom function by group
vars <- c("rwa", "ec", "el")
dt_temp <- copy(dt_201501)
dt_temp[, paste0(vars,"_","sum") := lapply(.SD, sum), 
        .SDcols = vars, by = country]
dt_temp[1:2]

# alternatively with pre-determined functions
funs <- c("min", "max", "mean", "sum") # <- define your function
for(i in funs){
  dt_temp[, paste0(vars, "_", i) := lapply(.SD, eval(i)), .SDcols = vars, 
          by = cpy_type] 
  }

# remove column
dt_temp[, el_sum := NULL]

summarise

dt_temp[inst_id == "ADJB2", list(rwa_min = min(rwa, na.rm = T), 
                                 rwa_avg = mean(rwa, na.rm = T), 
                                 rwa_max = max(rwa, na.rm = T),
                                 count = .N), 
        by = c("country")]

random sampling

no particular function

grouped operations

general

many different ways depending on grouping requirements
could be extended with more use cases

dplyr

# summarise by group
data_201501 %>%
  group_by(country) %>%
  summarise(count = n(),
            rwa_min = min(rwa, na.rm = T), 
            rwa_avg = mean(rwa, na.rm = T), 
            rwa_max = max(rwa, na.rm = T)) %>%
  head(., 2)

data.table

dt_temp[inst_id == "ADJB2", list(rwa_min = min(rwa, na.rm = T), 
                                 rwa_avg = mean(rwa, na.rm = T), 
                                 rwa_max = max(rwa, na.rm = T),
                                 count = .N), 
        by = c("country")][1:2]

merge / join / lookup

general

many ways of joining, different join types
indexing can increase performance
only a few use cases shown here (left join) but logic for other types the same
may be extended later

create base table

data_base_table <- 
  data_201501 %>%
  select(id, month, year, country, ubrtrn, inst_id, product_type,
         corep, cpy_type)

dt_base_table <- dt_201501[, .(id, month, year, country, ubrtrn, inst_id,
                               product_type, corep, cpy_type)]

dplyr

# left join -> see vignette for more
data_merged_table <-
  left_join(x = data_base_table,
            y = data_201501[, c("id", "rwa", "el", "ec", "pd", "lgd")],
            by = c("id" = "id"))

data.table

using pre-set keys vs ad-hoc keys

## left join -> see vignette for more
setkey(dt_base_table, id)
setkey(dt_201501, id)
dt_merged_table <-
  dt_base_table[dt_201501,
                list(id, month, year, country, ubrtrn, inst_id, product_type,
                     corep, cpy_type, rwa, el, ec, pd, lgd)]
# check how to avoid having to state all column names of base table,
# maybe using setdiff

## not using pre-key setting but rather ad-hoc key setting
# clear keys
setkey(dt_base_table, NULL)
setkey(dt_201501, NULL)
# generic syntax
dt_merged_table <- 
  dt_base_table[dt_201501,
                list(id, month, year, country, ubrtrn, inst_id, product_type,
                     corep, cpy_type, rwa, el, ec, pd, lgd),
                on = c(id = "id")]

# using data.table::merge function
dt_merged_table <- 
  merge(dt_base_table, dt_201501[, list(id, rwa, el, ec, pd, lgd)],
        by.x = "id", by.y = "id")

sort

general

working with index / keys automatically sorts

create unsorted table

data_unsorted <- copy(data_201501)
dt_unsorted <- copy(dt_201501)

dplyr

data_sorted <- arrange(data_unsorted, desc(year), desc(month), country, inst_id)

data.table

dt_sorted <- dt_unsorted[order(-year, -month, country, inst_id)]
# use setorder for order by reference (in-place), without making any additional copies
dt_sorted <- setorder(dt_unsorted, -year, -month, country, inst_id)

reshape

general

reshape generally refers to reshaping tables from long to wide and vice versa
dplyr does not provide reshape functions but reshape2 and tidyr (successor of reshape2) provide respective functions (both created by hadley wickham)
data.table and tidyr provide similar functions but use slightly different vocabulary
only the probably most common two reshape functions are explored but there is more
be aware of type conversion of melted/casted attribute

create long data

set.seed(123)
dt_201601 <- data.table(create_obs(n_obs=10000, month=1, year=2016))
dt_201602 <- data.table(create_obs(n_obs=10000, month=2, year=2016))
dt_201603 <- data.table(create_obs(n_obs=10000, month=3, year=2016))
data <- rbind(dt_201601, dt_201602, dt_201603)

tidyr

gather converts key to character by default

# spread rwa i.e. long to wide
temp <- spread(data = data[,.(id, year, month, rwa)], 
               key = month, 
               value = rwa)
# rename cols
names(temp) <- c("id", "year", "rwa_1", "rwa_2", "rwa_3")
str(temp)
# gather i.e. wide to long
temp <- gather(data = temp, key = month, value = rwa, rwa_1, rwa_2, rwa_3)
str(temp)

data.table

before data.table v1.9.6 one needs to load reshape2 library as well
if you've to load reshape2 make sure to load it before loading data.table
melt converts key to factor by default, control with variable.factor argument
more details @ \textcolor{blue}{data.table reshape vignette}

# long to wide
temp <- dcast(data = data[,.(id, year, month, rwa)],
              formula = id + year ~ month, 
              value.var = "rwa")
# rename cols
names(temp) <- c("id", "year", "rwa_1", "rwa_2", "rwa_3")
str(temp)

# wide to long
temp <- melt(data = temp, 
             id.vars = c("id", "year"),
             measure.vars = c("rwa_1", "rwa_2", "rwa_3"),
             variable.name = "month",
             value.name = "rwa")
str(temp)

import / export

general

dplyr does not provide distinct import/export function
readr is a separate package for reading files from the hadleyverse
no package provides distinct export function (except readr) -> cross-check data.table v1.9.7 for fwrite
also follow discussion around more efficient data formats, e.g. cross-platform (currently python & R) format "feather", details @ \textcolor{blue}{cran feather}

export with base r

# Write to a file, suppress row names
write.csv(data_201501, "./data.csv", row.names=FALSE)

# Same, except that instead of "NA", output dot as in sas
write.csv(data_201501, "./data.csv", row.names=FALSE, na=".")

# Use tabs, suppress row names
write.table(data_201501, "./data.csv", sep="\t", row.names=FALSE)

# use readr write_delim which is about twice as fast as write.csv, 
# and never writes row names.
# will not introduce quotes around text as write.table
write_delim(x=data_201501, path="./data.csv", delim="\t", append=FALSE)

readr (hadley.wickham)

data_import <- read_delim("./data.csv", delim="\t", col_names=TRUE)

data.table

# use fread auto setting to detect all parameters automatically or set explicitly
dt_import <- fread("./data.csv", sep="\t", header="auto")

miscellaneous

apply multiple pre-defined named functions to selected columns

general

create named list of function to be applied
functions are named
functions may be user-defined functions
to apply functions on grouped data, group data before or in same chain

dplyr

look into dplyr do()???
http://www.r-bloggers.com/dplyr-do-some-tips-for-using-and-programming/

data.table

# http://stackoverflow.com/questions/29620783/data-table-in-r-apply-multiple-functions-to-multiple-columns

summary_functions <- function(x) list(mean = mean(x, na.rm = T), 
                                      median = median(x, na.rm = T))

# this gives a vector output
dt_201501[, unlist(lapply(.SD, summary_functions)), 
          .SDcols = c("rwa", "ec", "el")]

# this gives a data.table output
dt_201501[, as.list(unlist(lapply(.SD, summary_functions))), 
          .SDcols = c("rwa", "ec", "el")]

# simplify call
summary_functions <- function(x) c(mean = mean(x, na.rm = T), 
                                   median = median(x, na.rm = T))
# this gives a vector output
dt_201501[, sapply(.SD, summary_functions), 
          .SDcols = c("rwa", "ec", "el")]

a case study with data.table

use example data from before and apply to some use cases
to be extended much more later
most frequent use case in terms of data analysis requirements follows split-apply-combine logic as proposed by Hadley Wickham @ \textcolor{blue}{The Split-Apply-Combine Strategy for Data Analysis}

knitr::include_graphics("./split-apply-combine.png")

knitr::include_graphics("./split-apply-combine-flat.png")

transform meta tables to data.table with keys

country_dt <- data.table(country_table, key = c("key_var"))
month_dt <- data.table(month_table, key = c("key_var"))
ubr_dt <- data.table(ubr_table, key = c("key_var"))
inst_id_dt <- data.table(inst_id_table, key = c("key_var"))
product_type_dt <- data.table(product_type_table, key = c("key_var"))
counterparty_type_dt <- data.table(counterparty_type_table, key = c("key_var"))
corep_dt <- data.table(corep_table, key = c("key_var"))

join base and stress and calculate delta on the fly

to use direct "data.table", keys should be set on all tables for best performance
starting v1.9.6 data.table can use ad-hoc keys as well
for many more details, see jangorecki's answer to SO question @ \textcolor{blue}{How to join (merge) data frames (inner, outer, left, right)?}
if you come from sql world, check this SO question @ \textcolor{blue}{Translating SQL joins on foreign keys to R data.table syntax}

setkey(dt_201501, id)
setkey(dt_str_201501, id)
setkey(dt_201502, id)
setkey(dt_str_201502, id)

introduce random outliers for every 100th row with static multiplier between 2-4

n <- dim(dt_str_201501)[1]
sample_rows <- sample(x = n, size = n/100, rep = F)
cols <- c("rwa", "el", "ec")

dt_str_201501[sample_rows, (cols) := lapply(.SD, foo, start=2, end=4, 
                                            step=1, repl=T),
              .SDcols = cols]

dt_str_201502[sample_rows, (cols) := lapply(.SD, foo, start=2, end=4, 
                                            step=1, repl=T),
              .SDcols = cols]

join base and stress data
there are various ways with subtle differences

# using direct syntax (interpreted as left join, i.e. X[Y] is a join, looking up X's 
# rows using Y (or Y's key if it has one) as an index.)
# using direct syntax will modify original object by reference. if a copy is needed,
# copy base table first and perform join afterwards (memory-inefficient)
# for unkeyed tables use parameter on = "id" as ad-hoc key
dt_join_201501 <- copy(dt_201501)
dt_join_201501[dt_str_201501, c("rwa_delta", "ec_delta", "el_delta", "pd_delta", 
                                "lgd_delta") := 
              .(i.rwa - rwa, i.ec - ec, i.el - el, i.pd - pd, i.lgd - lgd), 
            with = F, on = "id"]

# alternative syntax which some may find easier to debug
dt_join_201501 <- copy(dt_201501)
dt_join_201501[dt_str_201501, `:=` (rwa_delta = i.rwa - rwa,
                                    ec_delta = i.ec - ec,
                                    el_delta = i.el - el,
                                    pd_delta = i.pd - pd,
                                    lgd_delta = i.lgd - lgd),
               on = "id"]

dt_201501 <- data.table(create_obs(n_obs=10000, month=1, year=2015))
setkey(dt_201501, id)

# using merge function to create new table, use all=T if full join is wanted.
# cannot perform delta computation and join in one step?
dt_join_201501 <- 
  merge(dt_201501, dt_str_201501[, .(id, rwa, el, ec, pd, lgd)], 
        suffixes=c("_base", "_stress"))

dt_join_201501[, `:=` (rwa_delta = rwa_stress - rwa_base,
                       ec_delta = ec_stress - ec_base,
                       el_delta = el_stress - el_base,
                       pd_delta = pd_stress - pd_base,
                       lgd_delta = lgd_stress - lgd_base)]

# remove unnecessary cols
dt_join_201501[, c("rwa_stress", "ec_stress", 
                   "el_stress", "pd_stress", "lgd_stress") := NULL]

# compute relative delta
dt_join_201501[, `:=` (rwa_delta_rel = rwa_delta / rwa_base,
                       ec_delta_rel = ec_delta / ec_base,
                       el_delta_rel = el_delta / el_base)]

exploratory analysis

impact by various aggregations for different vars

dt_join_201501[, .(rwa_base = sum(rwa_base, na.rm = T),
                   rwa_delta = sum(rwa_delta, na.rm = T),
                   rwa_delta_min = min(rwa_delta, na.rm = T), 
                   rwa_delta_mean = mean(rwa_delta, na.rm = T),
                   rwa_delta_median = median(rwa_delta, na.rm = T),
                   rwa_delta_max = max(rwa_delta, na.rm = T),
                   count = .N), by = c("country")]

scale <- 1*10^6

ggplot(dt_join_201501, aes(country, rwa_delta/scale)) +
  geom_boxplot(fill = "white", colour = "darkblue", 
               outlier.colour = "red", outlier.shape = 1) +
    scale_y_continuous(labels=comma) +
    labs(title="RWA Delta (in mn)", x="Country", y="RWA Delta (in mn) \n")

# do the same with qplot for inst_id and ec
qplot(data = dt_join_201501, x = inst_id, y = ec_delta/scale, fill = inst_id, 
  geom = "boxplot", xlab = "\n inst_id", ylab = "EC Delta (in mn) \n", 
  main="EC Delta (in mn) \n") + 
  theme(legend.position="bottom") +
  # adding the mean with black circle
  geom_point(stat = "summary", fun.y = "mean", size = I(3), color = I("black")) + 
  geom_point(stat = "summary", fun.y = "mean", size = I(2.2), color = I("orange"))

ggplot(dt_join_201501, aes(x = pd_base)) + 
  geom_histogram(binwidth=.01, colour="darkblue", fill="white")

qplot(data = dt_join_201501, x = el_delta/scale, geom = "histogram", y = ..density.., 
      binwidth = 0.5, colour = I("white"), fill = I("orange"), 
      xlab = "\n EL Delta (in mn)", 
      ylab = "Density", main = "Histogram of EL Delta (in mn) \n")

ggplot(dt_join_201501, aes(pd_base, fill = country)) +
  geom_density(alpha = 0.1)

dt_join_201501 %>%
  ggplot(aes(x=rwa_base/1000000, y=rwa_delta/1000000)) +
  geom_point(shape=20, col="darkblue") +
  scale_y_continuous(labels=comma) +
  labs(title="RWA Base vs. RWA Stress Delta (in mn)", 
       x="Base (in mn)", y="Stress Delta (in mn)")

ggplot(dt_join_201501, aes(rwa_delta, fill = country)) +
  geom_density(alpha = 0.2)

pivot-like / olap-like queries

aggregation by meta table attributes

aggregations that involve meta table attributes require joins by respective key
ad-hoc query should not alter actual data but only produce desired output
ad-hoc key joins are preferrable as key can change with every query and one doesn't want to set new key explicitly every time

test <- copy(dt_join_201501)
setkey(test, product_type)

test[product_type_dt, c("product_type_desc") := 
              .(i.product_type_desc), 
     with = F][, .(rwa_base = sum(rwa_base, na.rm = T),
                   rwa_delta = sum(rwa_delta, na.rm = T),
                   rwa_delta_min = min(rwa_delta, na.rm = T), 
                   rwa_delta_mean = mean(rwa_delta, na.rm = T),
                   rwa_delta_median = median(rwa_delta, na.rm = T),
                   rwa_delta_max = max(rwa_delta, na.rm = T),
                   count = .N), by = c("product_type_desc")]

# do the same with ad-hoc-key
test <- copy(dt_join_201501)
setkey(test, NULL)

test[product_type_dt, c("product_type_desc") := 
              .(i.product_type_desc), with = F, 
     on = c(product_type="key_var")][, .(rwa_base = sum(rwa_base, na.rm = T),
                                         rwa_delta = sum(rwa_delta, na.rm = T),
                                         rwa_delta_min = min(rwa_delta, na.rm = T), 
                                         rwa_delta_mean = mean(rwa_delta, na.rm = T),
                                         rwa_delta_median = median(rwa_delta, na.rm = T),
                                         rwa_delta_max = max(rwa_delta, na.rm = T),
                                         count = .N), by = c("product_type_desc")]

building an ad-hoc aggregation function

Unlike data.frame, the := operator adds a column to both the object in global environment and used in the function
data.table saves computation time by not making copies unless explicitly directed to
to avoid that original data is changed by reference, two approaches come to mind:
- copy data, perform operation, delete data (space + time-inefficient)
- change original data by reference first and delete change afterwards (potentially violating dogma of leaving input data untouched)

# http://stackoverflow.com/questions/37706385/r-data-table-function-wrapper-around-ad-hoc-join-with-aggregation-in-a-chain
# all efforts to avoid magrittr chain failed. removing it, fails to return
# result table, assigning result and then return fails to delete aggregation var
# after return
agg_foo <- function(x, meta_tbl, x_key, meta_key, agg_by) { 
  x[meta_tbl, 
      (agg_by) := get(agg_by),
      on=setNames(meta_key, x_key)][, .(rwa_base = sum(rwa_base, na.rm = T),
                                        rwa_delta = sum(rwa_delta, na.rm = T),
                                        count = .N), by = c(agg_by)] %>%
    print(.)

  x[, (agg_by) := .(NULL)]
  }

test2 <- copy(test)
agg_foo(x=test2, meta_tbl=product_type_dt, 
        x_key="product_type", meta_key="key_var", 
        agg_by="product_type_desc")

# try without chain - check performance differenc if any
agg_foo <- function(x, meta_tbl, x_key, meta_key, agg_by) { 

  x[meta_tbl, (agg_by) := get(agg_by), on=setNames(meta_key, x_key)]

  temp <-
    x[, .(rwa_base = sum(rwa_base, na.rm = T),
          rwa_delta = sum(rwa_delta, na.rm = T),
          count = .N), by = c(agg_by)]

  x[, (agg_by) := .(NULL)]

  return(temp)
  }

test2 <- copy(test)
temp <-
  agg_foo(x=test2, meta_tbl=product_type_dt, 
          x_key="product_type", meta_key="key_var", 
          agg_by="product_type_desc")
temp

visualization with ggplot

general

when talking about ggplot, the library ggplot2 is meant
ggplot is an implementation of the grammar of graphics as proposed by Wilkinson 2005
for theoretical background see @ \textcolor{blue}{A Layered Grammar of Graphics by Hadley Wickham}
summary taken from Hadly Wickham in "Elegant Graphics for Data Anbalysis": "In brief, the grammar tells us that a statistical graphic is a mapping from data to aesthetic attributes (colour, shape, size) of geometric objects (points, lines, bars). The plot may also contain statistical transformations of the data and is drawn on a specific coordinate system. Facetting can be used to generate the same plot for different subsets of the dataset."

data

take data and introduce more randomness

introduce random outliers for a few obs


n <- dim(dt_str_201501)[1]
sample_rows <- sample(x = n, size = n/500, rep = F)
cols <- c("rwa", "el", "ec")

dt_str_201501[sample_rows, (cols) := foo_vec(.SD, start = -1, end = 5, step = 1, repl=T), .SDcols = cols]

dt_str_201502[sample_rows, (cols) := foo_vec(.SD, start = -1, end = 5, step = 1, repl=T), .SDcols = cols]

dt_join_201501 <- merge(dt_201501, dt_str_201501[, .(id, rwa, el, ec, pd, lgd)], by.x = c("id"), by.y = c("id"), suffixes = c("_base", "_stress"))

dt_join_201501[, := (rwa_delta = rwa_stress - rwa_base, ec_delta = ec_stress - ec_base, el_delta = el_stress - el_base, pd_delta = pd_stress - pd_base, lgd_delta = lgd_stress - lgd_base)]

remove unnecessary cols

dt_join_201501[, c("rwa_stress", "ec_stress", "el_stress", "pd_stress", "lgd_stress") := NULL]

compute relative delta

dt_join_201501[, := (rwa_delta_rel = rwa_delta / rwa_base, ec_delta_rel = ec_delta / ec_base, el_delta_rel = el_delta / el_base)]


```{r}
dt1 <- data.table(a=rep(10, 30))
dt2 <- data.table(b=rep(10, 30))

dt1[, c := foo_vec(.SD, start = -2, end = 10, step = 1,
                   repl = T), .SDcols = c("a")]

dt2[, d := foo_vec(.SD, start = -2, end = 10, step = 1,
                   repl = T), .SDcols = c("b")]

temp <- cbind(dt1, dt2)

temp %>%
  ggplot(aes(x = c, 
             y = d)) +
  geom_point(shape=20, col="darkblue") +
  scale_y_continuous(labels=comma) +
  labs(title="RWA Base vs. RWA Stress Delta (in mn)",
       x="Base (in mn)", y="Stress Delta (in mn)")

scale <- 1*10^6
dt_join_201501 %>%
  ggplot(aes(x = rwa_base / scale, 
             y = (rwa_base + rwa_delta) / scale)) +
  geom_point(shape=20, col="darkblue") +
  scale_y_continuous(labels=comma) +
  labs(title="RWA Base vs. RWA Stress (in mn)",
       x="Base (in mn)", y="Stress (in mn)")

dt_join_201501 %>%
  ggplot(aes(x = rwa_base / scale, 
             y = (rwa_delta) / scale)) +
  geom_point(shape=20, col="darkblue") +
  scale_y_continuous(labels=comma) +
  labs(title="RWA Base vs. RWA Stress Delta (in mn)",
       x="Base (in mn)", y="Stress Delta (in mn)") + 
  facet_wrap(~country)

# Adding a Smoother to a Plot
# alternative smoothing algorithms available
dt_join_201501 %>%
  ggplot(aes(x = rwa_base / scale, 
             y = (rwa_base + rwa_delta) / scale)) +
  geom_point(shape=20, col="darkblue") +
  scale_y_continuous(labels=comma) + 
  geom_smooth(method = "lm", colour = "red")

# Boxplots and Jittered Points
dt_join_201501 %>%
  ggplot(aes(x = country, 
             y = (rwa_delta / scale))) +
  geom_point(shape=20, col="darkblue") +
  scale_y_continuous(labels=comma)

ggplot(dt_join_201501, aes(country, rwa_delta/scale)) +
  geom_boxplot(fill = "white", colour = "darkblue", 
               outlier.colour = "red", outlier.shape = 1) +
    scale_y_continuous(labels = comma) +
    labs(title="RWA Delta (in mn)", x="Country", y="RWA Delta (in mn) \n")

# Histograms and Frequency Polygons
ggplot(dt_join_201501, aes(pd_base)) + 
  geom_histogram(binwidth = 0.01, fill = "darkblue")

ggplot(dt_join_201501, aes(pd_base)) + 
  geom_freqpoly(binwidth = 0.01, colour = "darkblue")

ggplot(dt_join_201501, aes(pd_base + pd_delta, colour = country)) + 
  geom_freqpoly()

ggplot(dt_join_201501, aes((pd_base + pd_delta), fill = country)) + 
  geom_histogram(binwidth = 0.01) + 
  facet_wrap(~country, ncol = 2)

# Bar Charts

ggplot(dt_join_201501, aes(country)) + 
  geom_bar(fill = "darkblue")

utilities

Sys.Date()
Sys.time()
Sys.timezone()

getwd()

# create dummy file
cat("file A\n", file = "./a.txt")

# list all files starting with a-b or r
list.files(path = ".", pattern = "^[a-br]", all.files = FALSE,
           full.names = FALSE, recursive = FALSE,
           ignore.case = FALSE, include.dirs = FALSE, no.. = FALSE)

# list directories, can include subdirectories via recursive = T
list.dirs(path = ".", full.names = T, recursive = F)

# file existence and edit rights
file.exists("./a.txt")
file.exists("./nosuchfile")

# test for existence.
file.access("./a.txt", mode = 0)
# test for execute permission.
file.access("./a.txt", mode = 1)
# test for write permission.
file.access("./a.txt", mode = 2)
# test for read permission.
file.access("./a.txt", mode = 4)

# file info
file.info("./a.txt")

# copy
file.copy("a.txt", "b.txt", overwrite = T)

# append
file.append("./a.txt", "./b.txt")
file.rename("a.txt", "a_new.txt")

# remove
file.remove(c("./a_new.txt", "./b.txt"))
file.remove("./nosuchfile")

# create and remove directory
dir.create("./folder1")
unlink(c("./folder1"), recursive = T)

benchmarking

fn_test <- function(x){return(rnorm(n = x, mean = 0, sd = 1))}
system.time(fn_test(1*10^6))
library(microbenchmark)
res <- microbenchmark(fn_test(1*10^6), times = 10L)
boxplot(res)
if (require("ggplot2")) {
  autoplot(res)
}

aggregation functions

agg function in two versions
- regular aggregation of selected metrics by selected dimensions
- delta aggregation as extension of regular aggregation with delta computation for selected variables (requires data in long format with variable indicating base and final)
regular aggregation parameters
- data as data.table object
- exposure metrics to be aggregated (required, could be optional)
- non-exposure metrics to be aggregated (weighting variable) (required when weighting)
- agregation type (sum, mean, median, min, max, etc.) (optional, default sum)
- names of aggregated variables (optional, default original names)
- dimensions to aggregate by (optional)
- names of dimension (optional, default original names)
- additional options such as na.rm (optional)
delta aggregation additional parameters
- indicator variable by which to compute delta
- base and final variable for delta direction
- variable to join base and final
- behavior in case of no join (left or full join)
exception handling
- input parameter presence
- input object type
- throw warning in case selection gives no aggregation
design choices
- operations on copy or in place
extras
- division of exposure metrics by scale
- ordering (optional, default dimensions as passed)
- count
- filter
ranking / percentile functions
- take selected delta and compute outliers via MAD percentile
http://stackoverflow.com/questions/39122916/r-custom-data-table-function-with-multiple-variable-inputs
http://stackoverflow.com/questions/37706385/r-data-table-function-wrapper-around-ad-hoc-join-with-aggregation-in-a-chain
http://stackoverflow.com/questions/10675182/in-r-data-table-how-do-i-pass-variable-parameters-to-an-expression
http://stackoverflow.com/questions/31989067/fastest-method-to-replace-data-values-conditionally-in-data-table-speed-compari
http://stackoverflow.com/questions/13756178/writings-functions-procedures-for-data-table-objects?rq=1
http://stackoverflow.com/questions/10225098/understanding-exactly-when-a-data-table-is-a-reference-to-vs-a-copy-of-another?rq=1
https://gitlab.com/jangorecki/data.cube/blob/master/R/datatable.R or https://github.com/jangorecki/data.cube/blob/master/R/datatable.R
https://github.com/jangorecki/shinyDTQ/blob/master/global.R
http://stackoverflow.com/questions/30468455/dynamically-build-call-for-lookup-multiple-columns
http://adv-r.had.co.nz/Expressions.html#metaprogramming
http://stackoverflow.com/questions/26883859/using-eval-in-data-table?rq=1
http://stackoverflow.com/questions/14837902/how-to-write-a-function-that-calls-a-function-that-calls-data-table
http://stackoverflow.com/questions/37007282/r-data-table-join-sql-select-alike-syntax-in-joined-tables
https://cran.r-project.org/doc/manuals/r-release/R-lang.html#Computing-on-the-language
pryr, purrr
http://stackoverflow.com/questions/28973056/in-r-pass-column-name-as-argument-and-use-it-in-function-with-dplyrmutate-a
http://stackoverflow.com/questions/37404931/fast-data-table-assign-of-multiple-columns-by-group-from-lookup
http://stackoverflow.com/questions/36647468/creating-a-function-with-an-argument-passed-to-dplyrfilter-what-is-the-best-wa
http://stackoverflow.com/questions/15790743/data-table-meta-programming
http://stackoverflow.com/questions/29401907/use-list-of-functions-with-dplyrsummarize-each
http://stackoverflow.com/questions/24833247/how-can-one-work-fully-generically-in-data-table-in-r-with-column-names-in-varia?noredirect=1&lq=1
http://stackoverflow.com/questions/21526674/r-how-to-use-as-call-with-vectors-as-optional-parameters
http://stackoverflow.com/questions/18064602/why-do-i-need-to-wrap-get-in-a-dummy-function-within-a-j-lapply-call?noredirect=1&lq=1
https://rawgit.com/wiki/Rdatatable/data.table/vignettes/datatable-faq.html#ok-but-i-dont-know-the-expressions-in-advance.-how-do-i-programatically-pass-them-in

library(data.table)
library(dplyr)
options(datatable.verbose=F)
n_size <- 1*10^6
sample_metrics <- sample(seq(from = 1, to = 100, by = 1), n_size, rep = T)
sample_dimensions <- sample(letters[10:12], n_size, rep = T)
df <- 
  data.frame(
    a = c(NA, sample_metrics),
    b = c(sample_metrics, NA),
    c = c(NA, sample_dimensions),
    d = c(sample_dimensions, NA),
    x = c(NA, sample_metrics),
    y = c(sample_dimensions, NA),
    stringsAsFactors = F)

dt <- as.data.table(df)

fn_dt_agg1 <- 
  function(dt, metric, metric_name, dimension, dimension_name) {

  # setnames in combination with lapply allows passing of variable names in SDcols
  temp <- dt[, setNames(lapply(.SD, function(x) {sum(x, na.rm = T)}), 
                        metric_name), 
             keyby = dimension, .SDcols = metric]

  #setorderv(temp, dimension) in case order is different than dimension

  temp[]
  }

res_dt1 <- 
  fn_dt_agg1(
    dt = dt, metric = c("a", "b"), metric_name = c("a", "b"),
    dimension = c("c", "d"), dimension_name = c("c", "d"))

fn_dt_agg2 = 
  function(dt, metric, metric_name, dimension, dimension_name,
           agg_type) {

  j_call = as.call(c(
    as.name("."),
    sapply(setNames(metric, metric_name), 
           function(var) as.call(list(as.name(agg_type), 
                                      as.name(var), na.rm = TRUE)), 
           simplify = FALSE)
    ))

  dt[, eval(j_call), keyby = dimension][]
  }

res_dt2 <- 
  fn_dt_agg2(
    dt = dt, metric = c("a", "b"), metric_name = c("a", "b"),
    dimension = c("c", "d"), dimension_name = c("c", "d"),
    agg_type = c("sum"))

all.equal(res_dt1, res_dt2)

fn_dt_agg3 <- 
  function(dt, metric, metric_name, dimension, dimension_name, agg_type) {

  e <- eval(parse(text=paste0("function(x) {", 
                              agg_type, "(", "x, na.rm = T)}"))) 

  temp <- dt[, setNames(lapply(.SD, e), 
                        metric_name), 
             keyby = dimension, .SDcols = metric]

  temp[]
  }

res_dt3 <- 
  fn_dt_agg3(
    dt = dt, metric = c("a", "b"), metric_name = c("a", "b"),
    dimension = c("c", "d"), dimension_name = c("c", "d"), 
    agg_type = "sum")

all.equal(res_dt1, res_dt3)

fn_dt_agg4 <- 
  function(dt, metric, metric_name, dimension, dimension_name, agg_type) {

    e <- function(x) getFunction(agg_type)(x, na.rm = T)

    temp <- dt[, setNames(lapply(.SD, e), 
                          metric_name), 
               keyby = dimension, .SDcols = metric]
    temp[]
  }

res_dt4 <- 
  fn_dt_agg4(
    dt = dt, metric = c("a", "b"), metric_name = c("a", "b"),
    dimension = c("c", "d"), dimension_name = c("c", "d"), 
    agg_type = "sum")

all.equal(res_dt1, res_dt4)

fn_df_agg1 <-
  function(df, metric, metric_name, dimension, dimension_name, agg_type) {

    all_vars <- c(dimension, metric)
    all_vars_new <- c(dimension_name, metric_name)

    # Convert character vector to list of symbols
    dots_group <- lapply(dimension, as.name)

    e <- eval(parse(text=paste0("function(x) {", 
                                agg_type, "(", "x, na.rm = T)}")))

    df %>%
      select_(.dots = all_vars) %>%
      group_by_(.dots = dots_group) %>%
      summarise_each_(funs(e), metric) %>%
      rename_(.dots = setNames(all_vars, all_vars_new))
  }

res_df1 <- 
  fn_df_agg1(
    df = df, metric = c("a", "b"), metric_name = c("a", "b"),
    dimension = c("c", "d"), dimension_name = c("c", "d"),
    agg_type = "sum")

all.equal(res_dt1, as.data.table(res_df1))

test_agg_type <- c("min")
test_agg_type <- c("max")
test_agg_type <- c("median")
test_agg_type <- c("mean")
test_agg_type <- c("sum")

#library(microbenchmark)
bench_res <- 
  microbenchmark(
    fn_dt_agg1 = 
      fn_dt_agg1(
        dt = dt, metric = c("a", "b"), metric_name = c("a", "b"), 
        dimension = c("c", "d"), dimension_name = c("c", "d")), 
    fn_dt_agg2 = 
      fn_dt_agg2(
        dt = dt, metric = c("a", "b"), metric_name = c("a", "b"), 
        dimension = c("c", "d"), dimension_name = c("c", "d"),
        agg_type = test_agg_type),
    fn_dt_agg3 =
      fn_dt_agg3(
        dt = dt, metric = c("a", "b"), metric_name = c("a", "b"),
        dimension = c("c", "d"), dimension_name = c("c", "d"),
        agg_type = test_agg_type),
    fn_dt_agg4 = 
      fn_dt_agg4(
        dt = dt, metric = c("a", "b"), metric_name = c("a", "b"),
        dimension = c("c", "d"), dimension_name = c("c", "d"), 
        agg_type = test_agg_type),
    fn_df_agg1 =
      fn_df_agg1(
        df = df, metric = c("a", "b"), metric_name = c("a", "b"),
        dimension = c("c", "d"), dimension_name = c("c", "d"),
        agg_type = test_agg_type),
    times = 100L)

bench_res
boxplot(bench_res)
if (require("ggplot2")) {
  autoplot(bench_res)
  }

fn_dt_agg <- 
  function(dt, metric, metric_name = metric, dimension = c(), 
           agg_type = c("sum"), na.rm = TRUE) {

    e <- function(x) getFunction(agg_type)(x, na.rm = na.rm)

    dt[, setNames(lapply(.SD, e), metric_name), 
       keyby = dimension, .SDcols = metric][]
  }

test_agg_type <- c("sum")
fn_dt_agg(
        dt = dt, metric = c("a", "b"), metric_name = c("a", "b"),
        dimension = c("c", "d"), dimension_name = c("c", "d"), 
        agg_type = test_agg_type)

# call with default parameters
fn_dt_agg(dt = dt, metric = c("a", "b"))

fn_dt_agg(dt = dt, metric = c("a", "b"), dimension = c(test = a > 1))

testing

http://stackoverflow.com/questions/29282994/how-to-write-a-testthat-unit-test-for-a-function-that-returns-a-data-frame

resources

general

\textcolor{blue}{data.table vs dplyr: can one do something well the other can't or does poorly?}

Triamus / play

star-like data wrangling #11

\usepackage{longtable} output: pdf_document: default html_notebook: default urlcolor: blue

release notes

required libraries

global options

overview

create simulated data

data structure

country table e.g. country of domicile

month table

ubr table

source system table

product type table

counterparty type table

corep class table

simulation function

create data

base data

dplyr

data.table

dplyr

data.table

stress data

basic operations

slicing, dicing, wrangling

dplyr

select columns

rename

filter rows

arrange / order

combining / chaining operations

distinct values

adding / removing / adjusting columns

summarise

random sampling

data.table

select columns

rename

filter

arrange / order

combining / chaining operations

distinct values and duplicates

adding / removing / adjusting columns

summarise

random sampling

grouped operations

general

dplyr

data.table

merge / join / lookup

general

create base table

dplyr

data.table

sort

general

create unsorted table

dplyr

data.table

reshape

general

create long data

tidyr

data.table

import / export

general

export with base r

readr (hadley.wickham)

data.table

miscellaneous

apply multiple pre-defined named functions to selected columns

general

dplyr

data.table

a case study with data.table

transform meta tables to data.table with keys

join base and stress and calculate delta on the fly

exploratory analysis

impact by various aggregations for different vars