Influence of additive constant on model prediction ability

ailinweili commented 7 years ago

The logical parameter "add" of pco-based gam and pco-based FDboost determines wether an additive constant should be added to distance matrix for euclidean representation. So I want to check the influence of additive constant on model prediction ability.

In the data application of Phillip Reiss' paper, add parameter is turned off for toydata application. For signature verification application, due to the fact that the code provided by Phillip can not be run successfully without error, I am not sure if the add parameter is turned on.

So I write simple code to compare the Influence of additive constant on pco-based gam and on pct-based FDboost. Two dataset is used, one for regression and another for classification. The ACC_add_medgf plot shows the CV accuracy of 8 models on MedicalImages dataset in a box plot. (For simplicity, I convert the multinomial response to binomial response.) The name of model consists of 3 parts, e.g "gam_euc_T" stands for pco-based gam model, with euclidean distance type, "add" equals to TRUE, "FDboost_dtw_F" represents pco-based FDboost model, with dtw distance type, "add" equals to FALSE. The MSE_add_toydata plot shows the CV mse of 4 models on toydata used by Phillip Reiss in a box plot.

ACC_add_medgf.pdf MSE_add_toydata.pdf

The results show that turn-on of "add" parameter leads to similar prediction performance of pco-based gam model, but WORSE performance of pco-based FDboost model!!!

library(foreign)
library(devtools)
source(file = "cvsource.R") # available at https://github.com/ailinweili/FDboost/tree/bfpco/cv
install_git("git://github.com/ailinweili/FDboost.git", branch = "bfpco")
library(FDboost)

require(poridge)   # available at https://github.com/dill/poridge
require(dtw)
require(fields); require(viridis)   # for color figures

# write function to perfrom cv for gam model
CVgampco <- function(cvdata, k = 5, distType, add = TRUE, family = gaussian, acc = FALSE, mse = TRUE,...){
  if(mse) MSE = vector("list",length = length(cvdata$cvtrain))
  if(acc) ACC = vector("list",length = length(cvdata$cvtrain))
  for(i in 1:length(cvdata$cvtrain)){
    respone = cvdata$cvtrain[[i]]$response
    dummy = rep(1, length(cvdata$cvtrain[[i]]$response))
    D <- dist(cvdata$cvtrain[[i]]$func_x, method = distType, ...)
    m = gam(respone ~ s(dummy, bs="pco", k = k , xt=list(D=D, fastcmd=FALSE, add = add)), method="REML", family = family)

    D_new <- dist(rbind(cvdata$cvtrain[[i]]$func_x, cvdata$cvtest[[i]]$func_x), method = distType, ...)
    n = nrow(cvdata$cvtrain[[i]]$func_x)
    nnew =  nrow(cvdata$cvtest[[i]]$func_x)
    D_new = as.matrix(D_new)[1:n, (n+1):(n+nnew)]
    dist_list<- list(dummy = D_new)

    pred_data <- pco_predict_preprocess(m, newdata = NULL, dist_list)
    if(mse == TRUE){
      p <- predict(m, pred_data)
      MSE[[i]] = sum((p - cvdata$cvtest[[i]]$response)^2)/length(p)
    }
    if(acc == TRUE){
      p <- predict(m, pred_data, type = "response")
      pclass <- ifelse(p > 0.5, TRUE, FALSE)
      ACC[[i]] = sum(pclass == cvdata$cvtest[[i]]$response)/length(pclass)
    }
  }
  return(list(MSE = if(mse) MSE, ACC = if(acc) ACC))
}

# Generate toy data
Xnl = matrix(0,30,101)
set.seed(813)
tau = sort(runif(30, .04, .96))
for (i in 1:30) {
  last.neg = ceiling(100*tau[i]); first.pos = last.neg+1
  Xnl[i, (last.neg-4):(last.neg)]=-1
  Xnl[i, (first.pos):(first.pos+4)]=1
}
X.toy = Xnl + matrix(rnorm(30*101,sd=.05),30)
y.toy = tau + rnorm(30, sd=.01)
y.rainbow = rainbow(30, end=0.9)[(y.toy-min(y.toy))/diff(range(y.toy))*29+1]
toydata <- list(y.toy = y.toy, X.toy = X.toy, s = 1:101)
mydata <- list(response = y.toy, func_x = X.toy, x_index = 1:101)
cvdata <- CVdata(mydata, nfold = 10, splitvclass = FALSE, response = "response", nosplitvars = "x_index")

# comopute MSE
MSE1 <- CVgampco(cvdata, distType = "dtw", window.type = "sakoechiba", window.size = 5, add = TRUE)
MSE1_m = mean(unlist(MSE1)) # 0.0004

MSE2 <- CVgampco(cvdata, distType = "dtw", window.type = "sakoechiba", window.size = 5, add = FALSE) #0.00044
MSE2_m = mean(unlist(MSE2)) #0.011

cvpartargs <- list(funname = "FDboost",cvdata = cvdata,mstop = 172, smstop = FALSE, 
                   mstop_grid = c(100,1000,5000,10000), B = 3, 
                   response = "response", ACC = FALSE, MSE = TRUE)

res3 <- do.call(CVmodel, args = c(cvpartargs, list(mdlargs = list(
  formula = formula(response ~ bfpco(func_x, s = x_index,df = 4, pve = 0.95, penalty = "identity", add = TRUE,
                                     distType = "dtw",window.type="sakoechiba", window.size=5)),
  timeformula = formula(~bols(1)), family = Gaussian()))))
MSE3_m = mean(res3$mse) # 0.08407

res4 <- do.call(CVmodel, args = c(cvpartargs, list(mdlargs = list(
  formula = formula(response ~ bfpco(func_x, s = x_index,df = 4, pve = 0.95, penalty = "identity", add = FALSE,
                                     distType = "dtw",window.type="sakoechiba", window.size=5)),
  timeformula = formula(~bols(1)), family = Gaussian()))))
MSE4_m = mean(res3$mse) #0.08407

# MSE boxplot
MSE_list <- list(gam_dtw_T = unlist(MSE1), gam_dtw_F = unlist(MSE2), FDboost_dtw_T = unlist(res3$mse), FDboost_dtw_F = unlist(res4$mse)) 
boxplot(MSE_list, main = "MSE of toydata")

## MedicalImages
# preprocess dataset MedicalImages 
# data available at http://www.timeseriesclassification.com/description.php?Dataset=MedicalImages
medgf <- read.arff("MedicalImages_TRAIN.arff") 
mymedgf <- medgf[1:200,]
mydata2 <- list(response = factor(mymedgf$target == 10), func_x = as.matrix(mymedgf[,-100]), x_index = as.numeric(1:99))

set.seed(8384)
cvdata <- CVdata(mydata2, nfold = 3, frac = 0.8, splitvclass = TRUE, 
                 response = "response", nosplitvars =  c("x_index"))

# compute accuracy
med_res1 <- CVgampco(cvdata, distType = "Euclidean", k = 5, family = binomial, add = TRUE, mse = FALSE, acc = TRUE)
ACC_m1 <- mean(unlist(med_res1$ACC)) #0.675
med_res2 <- CVgampco(cvdata, distType = "Euclidean", k = 5, family = binomial, add = FALSE, mse = FALSE, acc = TRUE)
ACC_m2 <- mean(unlist(med_res2$ACC)) #0.675
med_res3 <- CVgampco(cvdata, distType = "dtw", k = 5, family = binomial, add = TRUE, mse = FALSE, acc = TRUE)
ACC_m3 <- mean(unlist(med_res3$ACC)) #0.641
med_res4 <- CVgampco(cvdata, distType = "dtw", k = 5, family = binomial, add = FALSE, mse = FALSE, acc = TRUE)
ACC_m4 <- mean(unlist(med_res4$ACC)) #0.7

cvpartargs5 <- list(funname = "FDboost",cvdata = cvdata, mstop = 27, smstop = FALSE, B = 3, 
                   response = "response", ACC = TRUE, MSE = FALSE)

med_res5 <- do.call(CVmodel, args = c(cvpartargs5, list(mdlargs = list(
  formula = formula(response ~ bfpco(func_x, s = x_index,df = 4, pve = 0.95, penalty = "identity",add = TRUE,
                                     distType = "Euclidean")),
  timeformula = formula(~ bols(1)), family = Binomial()))))
ACC_m5 <- mean(unlist(med_res5$accuracy)) #0.725

cvpartargs6 <- cvpartargs5; cvpartargs6$mstop <- 79
med_res6 <- do.call(CVmodel, args = c(cvpartargs6, list(mdlargs = list(
  formula = formula(response ~ bfpco(func_x, s = x_index,df = 4, pve = 0.95, penalty = "identity",add = FALSE,
                                     distType = "Euclidean")),
  timeformula = formula(~ bols(1)), family = Binomial()))))
ACC_m6 <- mean(unlist(med_res6$accuracy)) #0.733

cvpartargs7 <- cvpartargs5; cvpartargs7$mstop <- 99
med_res7 <- do.call(CVmodel, args = c(cvpartargs6, list(mdlargs = list(
  formula = formula(response ~ bfpco(func_x, s = x_index,df = 4, pve = 0.95, penalty = "identity",add = TRUE,
                                     distType = "dtw")),
  timeformula = formula(~ bols(1)), family = Binomial()))))
ACC_m7 <- mean(unlist(med_res7$accuracy)) #0.45

cvpartargs8 <- cvpartargs5; cvpartargs8$mstop <- 77
med_res8 <- do.call(CVmodel, args = c(cvpartargs6, list(mdlargs = list(
  formula = formula(response ~ bfpco(func_x, s = x_index,df = 4, pve = 0.95, penalty = "identity",add = FALSE,
                                     distType = "dtw")),
  timeformula = formula(~ bols(1)), family = Binomial()))))
ACC_m8 <- mean(unlist(med_res8$accuracy)) #0.45
ACC_m8 # 0.766

# ACC boxplot
ACC_list <- list(gam_euc_T = unlist(med_res1$ACC), 
                 gam_euc_F = unlist(med_res2$ACC),
                 gam_dtw_T = unlist(med_res3$ACC),
                 gam_dtw_F = unlist(med_res4$ACC),
                 FDboost_euc_T = unlist(med_res5$accuracy),
                 FDboost_euc_F = unlist(med_res6$accuracy),
                 FDboost_dtw_T = unlist(med_res7$accuracy),
                 FDboost_dtw_F = unlist(med_res8$accuracy))
boxplot(ACC_list, main = "accuracy of MedicalImages data", las = 2)

fabian-s commented 7 years ago

First of all it's pretty cool that your implementation works as well or better than David Miller's "gamwith bs = "pco"" stuff in refund!

BUT:

to compare implementations, you need to set as many of their hyperparameters to identical values as possible (....duh :wink:). Specifically, you use k = 5 for the bs = "pco" stuff, but your algorithm, with pve = .95 uses 11-13 FPCoS. If I set k=12 for bs = "pco", the difference caused by add becomes larger for `bs = "pco" as well:

The first explanation for this behavior that comes to mind is that the way your predict function handles add = TRUE might not be entirely correct...have you doubledoublechecked? Other than that, I'm rather clueless, sorry.

ailinweili commented 7 years ago

I have checked the whole code, here is my founding.

The code I implemented in FDboost computes the same pco for new data as computed by pro_predict_preprocess function of refund package, if parameter "fastcmd" of pct-gam is turned on. When "fastcmd" is set to be FASLE, "cmdscale" function instead of "cmdscale_lanczos" function is used to conduct eigen-decomposition. In FDboost pkg, there is not "fastcmd" parameter, only cmdscale_lanczos is used.
The code written in this issue for pco-gam sets "fastcmd" parameter to be FASLE(default), which is also the case in the code Phllip used for his paper. If "fastcmd" is set to be TRUE, the plot for npc = 5 shows that the difference of pco-gam model performance based on different "add" value is greater than that of fpco-FDboost :
So, should I also enable "cmdscale" in FDboost? e.g. use new parameter fastcmd @fabian-s

fabian-s commented 7 years ago

Aaah, excellent! You found a reasonable sounding reason..... :+1:

Yes, let's use cmdscale as the default and implement a fastcmd-switch for using cmdscale_lanczos if the data are too big for cmdscale. Let's also set add = FALSE as the default since performance seems to be mostly better without it...

ailinweili / FDboost

Influence of additive constant on model prediction ability #8