cmu-delphi / epiforecast-R

R package to implement and visualize several epidemiological forecasting methods.
GNU General Public License v2.0
21 stars 5 forks source link

function cv_apply usage #10

Open yijunwang0805 opened 3 years ago

yijunwang0805 commented 3 years ago

Hi,

It is me again!

Could you please give an example for how to use cv_apply function?

Thank you!

brookslogan commented 3 years ago

Here are some illustrations of using cv_apply.R, first as a fancy apply function, second to do leave-one-out, and third to do time series CV. You can imagine the function passed to cv_apply fitting a model to the training data, or fitting a model on the training data and evaluating it on the test data; here I only have an example of extracting parts of the training and test sets.


library(magrittr)

data.array = tidyr::crossing(a=1:3,b=4:6,c=7:9) %>%
  dplyr::mutate(abc = paste0(a,b,c)) %>%
  reshape2::acast(a ~ b ~ c, value.var="abc")
names(dimnames(data.array)) <- c("a","b","c")

print(data.array)

## A simple example that doesn't look like CV:
cv_apply(data.array, list(each=NULL,each=NULL,all=NULL), function(train, test) {
  print("TRAIN")
  print("dim:")
  print(dim(train)) # 1 1 3
  print("dimnames:")
  print(dimnames(train)) # 1 1 3
  print("object:")
  print(train)
  print("RESHAPED")
  reshaped.train = train
  dim(reshaped.train) <- dim(train)[3]
  dimnames(reshaped.train) <- dimnames(train)[3]
  print("dim:")
  print(dim(reshaped.train))
  print("dimnames:")
  print(dimnames(reshaped.train))
  print("object:")
  print(reshaped.train)
  print(identical(train, test)) # train and test are sliced identically when using only `each` and `all`
  stop ('STOPPING AFTER THE FIRST "FOLD"')
})

## Leave-one-value-of-`c`-out-CV:
cv_apply(data.array, list(all=NULL,all=NULL,loo=NULL), function(train, test) {
  print("TRAIN")
  print(train) # has c=8 and c=9 data in the first fold
  print("TEST")
  print(test) # has c=7 data in the first fold
  stop ('STOPPING AFTER THE FIRST "FOLD"')
})

## "Time series CV" treating `c` as the time dimension, starting with the second value of `c` so that there is at least one value of `c` in the training set (so that the training set won't be empty):
results = cv_apply(data.array, list(all=NULL,all=NULL,oneahead=2), function(train, test) {
  print("TRAIN")
  print(dim(train)) # varies
  print(dimnames(train))
  print("TEST")
  print(dim(test)) # 3 3 1 for both folds
  print(dimnames(test))
  result = list(trainfirst=train[[1]], testfirst=test[[1]])
  return (result)
})

dim(results) # 2 1 1 2

dimnames(results)
## [[1]]
## [1] "trainfirst" "testfirst"

## $a
## [1] "all"

## $b
## [1] "all"

## $c
## [1] "8" "9"

names(dimnames(results)) # "" "a" "b" "c"

results[["trainfirst","all","all","8"]] # the value for `trainfirst` using all values of `a`, all values of `b`, and training data for values of `c` before "8"

results