title: "data wrangling on star like data v 1.0"
author: "me"
classoption: landscape
toc: true
toc_depth: 2
header-includes:
\usepackage{longtable}
output:
pdf_document: default
html_notebook: default
urlcolor: blue
\pagebreak
release notes
initial version
required libraries
global load, libraries sometimes loaded in respective section to indicate which is needed for which operation or to ensure correct function is used in case of name clash
# output options
options("scipen"=100)
#opts_chunk$set(tidy.opts=list(width.cutoff=60),tidy=TRUE)
options(xtable.floating = FALSE)
options(xtable.timestamp = "")
options(xtable.comment = FALSE)
# tex / pandoc options for pdf creation
x <- Sys.getenv("PATH")
y <- paste(x, "E:\\Datenordner\\Downloads\\miktex\\miktex\\bin", sep=";")
Sys.setenv(PATH = y)
# always stringsAsFactors = F; if factors needed, declare them explicitly
options(stringsAsFactors = F)
# set work directory for read / write operations
path <- "E:/Datenordner/raw_data"
# if data is some place else, set root.dir option
opts_knit$set(root.dir = path)
#setwd(paste(path, sep="/"))
\pagebreak
overview
create simulated fdw- / ste-like data for reproducability
data wrangling operations with dplyr/tidyr (hadleyverse) vs data.table package
data will be set up in star schema, i.e. with dimensions (attributes) and facts (metrics). each entity has an identifier for lookups. this is a oversimplifiied version of the data warehouse
the fact table holds all metrics references all dimension tables via respective key. for simplicity each fact holds all keys
knitr::include_graphics("./fdw_star.png")
country table e.g. country of domicile
country table holds information on country and region
looking at summary of data. while data.frame would print everything until max_out, dt only prints first and last 5 rows if certain row number is exceeded. use str function to only look at summary.
str(data_201501)
str(dt_201501)
dt_201501
simulate slightly different base data for another month-end. take first month-end to ensure meta data stays constant (e.g. country does not change over months). allowing +/- 10% change over months via sampling of multiplier.
# take deep copy rather than reference
data_201502 <- copy(data_201501)
dt_201502 <- copy(dt_201501)
# define columns to change over time
cols <- c("rwa", "el", "ec", "pd", "lgd")
create a sampling function that allows to determine range of multiplier and step size
foo <- function(x, start, end, step, repl) {
x * sample(seq(from = start, to = end, by = step), size = 1, replace = repl)
}
# vectorize function by removing size argument
foo_vec <- function(x, start, end, step, repl) {
x * sample(seq(from = start, to = end, by = step), replace = repl)
}
dplyr
dplyr uses non-standard evaluation (NSE)
a lot of functions have standard evaluation equivalents and are generally denoted via underscore "_" after function name
use pipe operator %>% for chaining (more examples later)
# using standard evaluation version of mutate_each function to operate on vector
# of strings representing columns
data_201502 <-
data_201502 %>%
mutate_each_(funs(foo(., start = 0.9, end = 1.1, step = 0.01, repl = T)), cols)
# adjust month-end
data_201502 <-
data_201502 %>%
mutate(month = 2,
year = 2015)
data.table
data.table performs operation on actual object, i.e. no copy, so asignment operator not required
stress will generally introduce a higher value for all metrics but at times may also lead to decreasing exposure metrics, e.g. due to hedges. sampling function should therefore reflect this. do dplyr and data.table in one go as logic has been made clear before.
mutate allows to refer to columns that were just created
"." in dplyr is placeholder for data
data_201501 %>%
mutate(new_col = rwa - el,
ec = ec * 2,
new_col_2 = new_col / 2) %>%
head(., 2)
# if you only want to keep the new variables, use transmute()
head(transmute(data_201501,
new_col = rwa - el,
new_col_2 = new_col / 2), 2)
data.table inherits some functions from data.frame class so syntax is the same
.N contains number of rows
select columns
different ways to select
dt_201501[1]
dt_201501[1, ]
dt_201501[1:2, ]
dt_201501$id[1:5]
# extracting one column does not work
dt_201501[, 1]
# instead give column name
head(dt_201501[, country], 2)
# above subset returns a vector; Try DT[,.(country)] instead. .() is an
# alias for list() and ensures a data.table is returned.
dt_201501[1:2, .(country)]
dt_201501[1:2, list(country)]
mycol <- "country"
# returns a vector
head(dt_201501[[mycol]], 2)
# with = F returns table
dt_201501[1:2, mycol, with = FALSE]
# grab multiple columns
dt_201501[1:2, list(country, cpy_type)]
# penultimate row of DT using `.N`
dt_201501[.N-1]
# dimensions and names
colnames(dt_201501)
dim(dt_201501)
# select row 2 twice and row 3 for selected columns
dt_201501[c(2,2,3), .(id, country)]
table needs to have keys set that should be checked for duplicates
table can have multiple key and checking duplicates in only one still possible
if no keys are set, entire row is treated as key
note that duplicated only checks for duplicate rows with smaller subscripts so even if first row is duplicated later, it will not be true as there is no smaller subscript
dt_with_key <- copy(dt_201501)
# use setkeyv to set more than one key
setkeyv(dt_with_key, c("country", "inst_id"))
key(dt_with_key)
head(dt_with_key, 3)
duplicated(dt_with_key, by=c("country", "inst_id"))[1:5]
# since key is already set, also works without explicitly setting it
duplicated(dt_with_key)[1:5]
# will not work on ad-hoc keys -> result is wrong
head(duplicated(dt_201501, by=c("country", "inst_id")), 2)
# for table with no key entire row will be evaluated as key
duplicated(dt_201501)[1:5]
unique returns a data.table with duplicated rows (by key) removed, or (when no key) duplicated rows by all columns removed
anyDuplicated returns the index i of the first duplicated entry if there is one, and 0 otherwise.
uniqueN is equivalent to length(unique(x)) but much faster for atomic vectors
dt_with_key_unique <- unique(dt_with_key)
dt_with_key_unique[1:5, .(country, inst_id)]
# return index of first duplicate if there is one, otherwise 0
# note by=key(NULL) sets any pre-set key to NULL
anyDuplicated(dt_with_key, by=key(NULL))
anyDuplicated(dt_with_key)
uniqueN(dt_with_key)
adding / removing / adjusting columns
# set country code with less than 2 characters to unknown
dt_201501[nchar(country) < 2, country := "Unknown"]
# set na values to zero
dt_201501[is.na(rwa), rwa := 0]
# add new columns via custom function by group
vars <- c("rwa", "ec", "el")
dt_temp <- copy(dt_201501)
dt_temp[, paste0(vars,"_","sum") := lapply(.SD, sum),
.SDcols = vars, by = country]
dt_temp[1:2]
# alternatively with pre-determined functions
funs <- c("min", "max", "mean", "sum") # <- define your function
for(i in funs){
dt_temp[, paste0(vars, "_", i) := lapply(.SD, eval(i)), .SDcols = vars,
by = cpy_type]
}
# remove column
dt_temp[, el_sum := NULL]
# left join -> see vignette for more
data_merged_table <-
left_join(x = data_base_table,
y = data_201501[, c("id", "rwa", "el", "ec", "pd", "lgd")],
by = c("id" = "id"))
data.table
using pre-set keys vs ad-hoc keys
## left join -> see vignette for more
setkey(dt_base_table, id)
setkey(dt_201501, id)
dt_merged_table <-
dt_base_table[dt_201501,
list(id, month, year, country, ubrtrn, inst_id, product_type,
corep, cpy_type, rwa, el, ec, pd, lgd)]
# check how to avoid having to state all column names of base table,
# maybe using setdiff
## not using pre-key setting but rather ad-hoc key setting
# clear keys
setkey(dt_base_table, NULL)
setkey(dt_201501, NULL)
# generic syntax
dt_merged_table <-
dt_base_table[dt_201501,
list(id, month, year, country, ubrtrn, inst_id, product_type,
corep, cpy_type, rwa, el, ec, pd, lgd),
on = c(id = "id")]
# using data.table::merge function
dt_merged_table <-
merge(dt_base_table, dt_201501[, list(id, rwa, el, ec, pd, lgd)],
by.x = "id", by.y = "id")
dt_sorted <- dt_unsorted[order(-year, -month, country, inst_id)]
# use setorder for order by reference (in-place), without making any additional copies
dt_sorted <- setorder(dt_unsorted, -year, -month, country, inst_id)
reshape
general
reshape generally refers to reshaping tables from long to wide and vice versa
dplyr does not provide reshape functions but reshape2 and tidyr (successor of reshape2) provide respective functions (both created by hadley wickham)
data.table and tidyr provide similar functions but use slightly different vocabulary
only the probably most common two reshape functions are explored but there is more
be aware of type conversion of melted/casted attribute
# long to wide
temp <- dcast(data = data[,.(id, year, month, rwa)],
formula = id + year ~ month,
value.var = "rwa")
# rename cols
names(temp) <- c("id", "year", "rwa_1", "rwa_2", "rwa_3")
str(temp)
# wide to long
temp <- melt(data = temp,
id.vars = c("id", "year"),
measure.vars = c("rwa_1", "rwa_2", "rwa_3"),
variable.name = "month",
value.name = "rwa")
str(temp)
import / export
general
dplyr does not provide distinct import/export function
readr is a separate package for reading files from the hadleyverse
no package provides distinct export function (except readr) -> cross-check data.table v1.9.7 for fwrite
also follow discussion around more efficient data formats, e.g. cross-platform (currently python & R) format "feather", details @ \textcolor{blue}{cran feather}
export with base r
# Write to a file, suppress row names
write.csv(data_201501, "./data.csv", row.names=FALSE)
# Same, except that instead of "NA", output dot as in sas
write.csv(data_201501, "./data.csv", row.names=FALSE, na=".")
# Use tabs, suppress row names
write.table(data_201501, "./data.csv", sep="\t", row.names=FALSE)
# use readr write_delim which is about twice as fast as write.csv,
# and never writes row names.
# will not introduce quotes around text as write.table
write_delim(x=data_201501, path="./data.csv", delim="\t", append=FALSE)
# using direct syntax (interpreted as left join, i.e. X[Y] is a join, looking up X's
# rows using Y (or Y's key if it has one) as an index.)
# using direct syntax will modify original object by reference. if a copy is needed,
# copy base table first and perform join afterwards (memory-inefficient)
# for unkeyed tables use parameter on = "id" as ad-hoc key
dt_join_201501 <- copy(dt_201501)
dt_join_201501[dt_str_201501, c("rwa_delta", "ec_delta", "el_delta", "pd_delta",
"lgd_delta") :=
.(i.rwa - rwa, i.ec - ec, i.el - el, i.pd - pd, i.lgd - lgd),
with = F, on = "id"]
# alternative syntax which some may find easier to debug
dt_join_201501 <- copy(dt_201501)
dt_join_201501[dt_str_201501, `:=` (rwa_delta = i.rwa - rwa,
ec_delta = i.ec - ec,
el_delta = i.el - el,
pd_delta = i.pd - pd,
lgd_delta = i.lgd - lgd),
on = "id"]
dt_201501 <- data.table(create_obs(n_obs=10000, month=1, year=2015))
setkey(dt_201501, id)
# using merge function to create new table, use all=T if full join is wanted.
# cannot perform delta computation and join in one step?
dt_join_201501 <-
merge(dt_201501, dt_str_201501[, .(id, rwa, el, ec, pd, lgd)],
suffixes=c("_base", "_stress"))
dt_join_201501[, `:=` (rwa_delta = rwa_stress - rwa_base,
ec_delta = ec_stress - ec_base,
el_delta = el_stress - el_base,
pd_delta = pd_stress - pd_base,
lgd_delta = lgd_stress - lgd_base)]
# remove unnecessary cols
dt_join_201501[, c("rwa_stress", "ec_stress",
"el_stress", "pd_stress", "lgd_stress") := NULL]
# compute relative delta
dt_join_201501[, `:=` (rwa_delta_rel = rwa_delta / rwa_base,
ec_delta_rel = ec_delta / ec_base,
el_delta_rel = el_delta / el_base)]
exploratory analysis
impact by various aggregations for different vars
dt_join_201501[, .(rwa_base = sum(rwa_base, na.rm = T),
rwa_delta = sum(rwa_delta, na.rm = T),
rwa_delta_min = min(rwa_delta, na.rm = T),
rwa_delta_mean = mean(rwa_delta, na.rm = T),
rwa_delta_median = median(rwa_delta, na.rm = T),
rwa_delta_max = max(rwa_delta, na.rm = T),
count = .N), by = c("country")]
scale <- 1*10^6
ggplot(dt_join_201501, aes(country, rwa_delta/scale)) +
geom_boxplot(fill = "white", colour = "darkblue",
outlier.colour = "red", outlier.shape = 1) +
scale_y_continuous(labels=comma) +
labs(title="RWA Delta (in mn)", x="Country", y="RWA Delta (in mn) \n")
# do the same with qplot for inst_id and ec
qplot(data = dt_join_201501, x = inst_id, y = ec_delta/scale, fill = inst_id,
geom = "boxplot", xlab = "\n inst_id", ylab = "EC Delta (in mn) \n",
main="EC Delta (in mn) \n") +
theme(legend.position="bottom") +
# adding the mean with black circle
geom_point(stat = "summary", fun.y = "mean", size = I(3), color = I("black")) +
geom_point(stat = "summary", fun.y = "mean", size = I(2.2), color = I("orange"))
ggplot(dt_join_201501, aes(x = pd_base)) +
geom_histogram(binwidth=.01, colour="darkblue", fill="white")
qplot(data = dt_join_201501, x = el_delta/scale, geom = "histogram", y = ..density..,
binwidth = 0.5, colour = I("white"), fill = I("orange"),
xlab = "\n EL Delta (in mn)",
ylab = "Density", main = "Histogram of EL Delta (in mn) \n")
ggplot(dt_join_201501, aes(pd_base, fill = country)) +
geom_density(alpha = 0.1)
dt_join_201501 %>%
ggplot(aes(x=rwa_base/1000000, y=rwa_delta/1000000)) +
geom_point(shape=20, col="darkblue") +
scale_y_continuous(labels=comma) +
labs(title="RWA Base vs. RWA Stress Delta (in mn)",
x="Base (in mn)", y="Stress Delta (in mn)")
ggplot(dt_join_201501, aes(rwa_delta, fill = country)) +
geom_density(alpha = 0.2)
pivot-like / olap-like queries
aggregation by meta table attributes
aggregations that involve meta table attributes require joins by respective key
ad-hoc query should not alter actual data but only produce desired output
ad-hoc key joins are preferrable as key can change with every query and one doesn't want to set new key explicitly every time
summary taken from Hadly Wickham in "Elegant Graphics for Data Anbalysis": "In brief, the grammar tells us that a statistical graphic is a mapping from data to aesthetic attributes (colour, shape, size) of geometric objects (points, lines, bars). The plot may also contain statistical transformations of the data and is drawn on a specific coordinate system. Facetting can be used to generate the same plot for different subsets of the dataset."
data
take data and introduce more randomness
introduce random outliers for a few obs
n <- dim(dt_str_201501)[1]
sample_rows <- sample(x = n, size = n/500, rep = F)
cols <- c("rwa", "el", "ec")
```{r}
dt1 <- data.table(a=rep(10, 30))
dt2 <- data.table(b=rep(10, 30))
dt1[, c := foo_vec(.SD, start = -2, end = 10, step = 1,
repl = T), .SDcols = c("a")]
dt2[, d := foo_vec(.SD, start = -2, end = 10, step = 1,
repl = T), .SDcols = c("b")]
temp <- cbind(dt1, dt2)
temp %>%
ggplot(aes(x = c,
y = d)) +
geom_point(shape=20, col="darkblue") +
scale_y_continuous(labels=comma) +
labs(title="RWA Base vs. RWA Stress Delta (in mn)",
x="Base (in mn)", y="Stress Delta (in mn)")
scale <- 1*10^6
dt_join_201501 %>%
ggplot(aes(x = rwa_base / scale,
y = (rwa_base + rwa_delta) / scale)) +
geom_point(shape=20, col="darkblue") +
scale_y_continuous(labels=comma) +
labs(title="RWA Base vs. RWA Stress (in mn)",
x="Base (in mn)", y="Stress (in mn)")
dt_join_201501 %>%
ggplot(aes(x = rwa_base / scale,
y = (rwa_delta) / scale)) +
geom_point(shape=20, col="darkblue") +
scale_y_continuous(labels=comma) +
labs(title="RWA Base vs. RWA Stress Delta (in mn)",
x="Base (in mn)", y="Stress Delta (in mn)") +
facet_wrap(~country)
# Adding a Smoother to a Plot
# alternative smoothing algorithms available
dt_join_201501 %>%
ggplot(aes(x = rwa_base / scale,
y = (rwa_base + rwa_delta) / scale)) +
geom_point(shape=20, col="darkblue") +
scale_y_continuous(labels=comma) +
geom_smooth(method = "lm", colour = "red")
# Boxplots and Jittered Points
dt_join_201501 %>%
ggplot(aes(x = country,
y = (rwa_delta / scale))) +
geom_point(shape=20, col="darkblue") +
scale_y_continuous(labels=comma)
ggplot(dt_join_201501, aes(country, rwa_delta/scale)) +
geom_boxplot(fill = "white", colour = "darkblue",
outlier.colour = "red", outlier.shape = 1) +
scale_y_continuous(labels = comma) +
labs(title="RWA Delta (in mn)", x="Country", y="RWA Delta (in mn) \n")
# Histograms and Frequency Polygons
ggplot(dt_join_201501, aes(pd_base)) +
geom_histogram(binwidth = 0.01, fill = "darkblue")
ggplot(dt_join_201501, aes(pd_base)) +
geom_freqpoly(binwidth = 0.01, colour = "darkblue")
ggplot(dt_join_201501, aes(pd_base + pd_delta, colour = country)) +
geom_freqpoly()
ggplot(dt_join_201501, aes((pd_base + pd_delta), fill = country)) +
geom_histogram(binwidth = 0.01) +
facet_wrap(~country, ncol = 2)
# Bar Charts
ggplot(dt_join_201501, aes(country)) +
geom_bar(fill = "darkblue")
utilities
Sys.Date()
Sys.time()
Sys.timezone()
getwd()
# create dummy file
cat("file A\n", file = "./a.txt")
# list all files starting with a-b or r
list.files(path = ".", pattern = "^[a-br]", all.files = FALSE,
full.names = FALSE, recursive = FALSE,
ignore.case = FALSE, include.dirs = FALSE, no.. = FALSE)
# list directories, can include subdirectories via recursive = T
list.dirs(path = ".", full.names = T, recursive = F)
# file existence and edit rights
file.exists("./a.txt")
file.exists("./nosuchfile")
# test for existence.
file.access("./a.txt", mode = 0)
# test for execute permission.
file.access("./a.txt", mode = 1)
# test for write permission.
file.access("./a.txt", mode = 2)
# test for read permission.
file.access("./a.txt", mode = 4)
# file info
file.info("./a.txt")
# copy
file.copy("a.txt", "b.txt", overwrite = T)
# append
file.append("./a.txt", "./b.txt")
file.rename("a.txt", "a_new.txt")
# remove
file.remove(c("./a_new.txt", "./b.txt"))
file.remove("./nosuchfile")
# create and remove directory
dir.create("./folder1")
unlink(c("./folder1"), recursive = T)
benchmarking
fn_test <- function(x){return(rnorm(n = x, mean = 0, sd = 1))}
system.time(fn_test(1*10^6))
library(microbenchmark)
res <- microbenchmark(fn_test(1*10^6), times = 10L)
boxplot(res)
if (require("ggplot2")) {
autoplot(res)
}
aggregation functions
agg function in two versions
regular aggregation of selected metrics by selected dimensions
delta aggregation as extension of regular aggregation with delta computation for selected variables (requires data in long format with variable indicating base and final)
regular aggregation parameters
data as data.table object
exposure metrics to be aggregated (required, could be optional)
non-exposure metrics to be aggregated (weighting variable) (required when weighting)
title: "data wrangling on star like data v 1.0" author: "me" classoption: landscape toc: true toc_depth: 2 header-includes:
\usepackage{longtable} output: pdf_document: default html_notebook: default urlcolor: blue
\pagebreak
release notes
required libraries
global options
\pagebreak
overview
\pagebreak
create simulated data
data structure
country table e.g. country of domicile
month table
ubr table
source system table
product type table
counterparty type table
corep class table
simulation function
create data
base data
dplyr
data.table
dplyr
data.table
stress data
basic operations
slicing, dicing, wrangling
dplyr
select columns
rename
filter rows
arrange / order
combining / chaining operations
distinct values
adding / removing / adjusting columns
summarise
random sampling
data.table
select columns
rename
filter
arrange / order
combining / chaining operations
distinct values and duplicates
adding / removing / adjusting columns
summarise
random sampling
grouped operations
general
dplyr
data.table
merge / join / lookup
general
create base table
dplyr
data.table
sort
general
create unsorted table
dplyr
data.table
reshape
general
create long data
tidyr
data.table
import / export
general
export with base r
readr (hadley.wickham)
data.table
miscellaneous
apply multiple pre-defined named functions to selected columns
general
dplyr
data.table
a case study with data.table
transform meta tables to data.table with keys
join base and stress and calculate delta on the fly
exploratory analysis
impact by various aggregations for different vars
pivot-like / olap-like queries
aggregation by meta table attributes
building an ad-hoc aggregation function
visualization with ggplot
general
data
dt_str_201501[sample_rows, (cols) := foo_vec(.SD, start = -1, end = 5, step = 1, repl=T), .SDcols = cols]
dt_str_201502[sample_rows, (cols) := foo_vec(.SD, start = -1, end = 5, step = 1, repl=T), .SDcols = cols]
dt_join_201501 <- merge(dt_201501, dt_str_201501[, .(id, rwa, el, ec, pd, lgd)], by.x = c("id"), by.y = c("id"), suffixes = c("_base", "_stress"))
dt_join_201501[,
:=
(rwa_delta = rwa_stress - rwa_base, ec_delta = ec_stress - ec_base, el_delta = el_stress - el_base, pd_delta = pd_stress - pd_base, lgd_delta = lgd_stress - lgd_base)]remove unnecessary cols
dt_join_201501[, c("rwa_stress", "ec_stress", "el_stress", "pd_stress", "lgd_stress") := NULL]
compute relative delta
dt_join_201501[,
:=
(rwa_delta_rel = rwa_delta / rwa_base, ec_delta_rel = ec_delta / ec_base, el_delta_rel = el_delta / el_base)]utilities
benchmarking
aggregation functions
agg function in two versions
regular aggregation parameters
delta aggregation additional parameters
exception handling
design choices
extras
ranking / percentile functions
http://stackoverflow.com/questions/39122916/r-custom-data-table-function-with-multiple-variable-inputs
http://stackoverflow.com/questions/37706385/r-data-table-function-wrapper-around-ad-hoc-join-with-aggregation-in-a-chain
http://stackoverflow.com/questions/10675182/in-r-data-table-how-do-i-pass-variable-parameters-to-an-expression
http://stackoverflow.com/questions/31989067/fastest-method-to-replace-data-values-conditionally-in-data-table-speed-compari
http://stackoverflow.com/questions/13756178/writings-functions-procedures-for-data-table-objects?rq=1
http://stackoverflow.com/questions/10225098/understanding-exactly-when-a-data-table-is-a-reference-to-vs-a-copy-of-another?rq=1
https://gitlab.com/jangorecki/data.cube/blob/master/R/datatable.R or https://github.com/jangorecki/data.cube/blob/master/R/datatable.R
https://github.com/jangorecki/shinyDTQ/blob/master/global.R
http://stackoverflow.com/questions/30468455/dynamically-build-call-for-lookup-multiple-columns
http://adv-r.had.co.nz/Expressions.html#metaprogramming
http://stackoverflow.com/questions/26883859/using-eval-in-data-table?rq=1
http://stackoverflow.com/questions/14837902/how-to-write-a-function-that-calls-a-function-that-calls-data-table
http://stackoverflow.com/questions/37007282/r-data-table-join-sql-select-alike-syntax-in-joined-tables
https://cran.r-project.org/doc/manuals/r-release/R-lang.html#Computing-on-the-language
pryr, purrr
http://stackoverflow.com/questions/28973056/in-r-pass-column-name-as-argument-and-use-it-in-function-with-dplyrmutate-a
http://stackoverflow.com/questions/37404931/fast-data-table-assign-of-multiple-columns-by-group-from-lookup
http://stackoverflow.com/questions/36647468/creating-a-function-with-an-argument-passed-to-dplyrfilter-what-is-the-best-wa
http://stackoverflow.com/questions/15790743/data-table-meta-programming
http://stackoverflow.com/questions/29401907/use-list-of-functions-with-dplyrsummarize-each
http://stackoverflow.com/questions/24833247/how-can-one-work-fully-generically-in-data-table-in-r-with-column-names-in-varia?noredirect=1&lq=1
http://stackoverflow.com/questions/21526674/r-how-to-use-as-call-with-vectors-as-optional-parameters
http://stackoverflow.com/questions/18064602/why-do-i-need-to-wrap-get-in-a-dummy-function-within-a-j-lapply-call?noredirect=1&lq=1
https://rawgit.com/wiki/Rdatatable/data.table/vignettes/datatable-faq.html#ok-but-i-dont-know-the-expressions-in-advance.-how-do-i-programatically-pass-them-in
testing
http://stackoverflow.com/questions/29282994/how-to-write-a-testthat-unit-test-for-a-function-that-returns-a-data-frame
resources
general
dplyr
data table